Interesting approach! I’ve been building something complementary on the deterministic side.
LLM-as-judge guardrails are fundamentally probabilistic and can be gamed or hallucinate themselves (as several comments pointed out).
That’s why I built EvalView — it does full trajectory snapshots + diffs so you can see exactly what changed, plus a lightweight zero-judge model-check that directly pings the model and reports drift level (NONE / WEAK / MEDIUM / STRONG).
Gives you deterministic regression detection that works alongside (or instead of) LLM judges.
https://github.com/hidai25/eval-view
Curious how you handle drift detection in CrabTrap.
// The policy is embedded as a JSON-escaped value inside a structured JSON object.
// This prevents prompt injection via policy content — any special characters,
// delimiters, or instruction-like text in the policy are safely escaped by
// json.Marshal rather than concatenated as raw text.
yakkomajuri
Really cool! I'm also building something in this space but taking a slightly different approach. I'm glad to see more focus on security for production agentic workflows though, as I think we don't talk about it enough when it comes to claws and other autonomous agents.
I think you're spot on with the fact that it's so far it's been either all or nothing. You either give an agent a lot of access and it's really powerful but proportionally dangerous or you lock it down so much that it's no longer useful.
I like a lot of the ideas you show here, but I also worry that LLM-as-a-judge is fundamentally a probabilistic guardrail that is inherently limited. How do you see this? It feels dangerous to rely on a security system that's not based on hard limitations but rather probabilities?
show comments
roywiggins
It's all fine until OpenClaw decides to start prompt injecting the judge
show comments
ArielTM
The debate here is missing a practical question: is the judge from the same model family as the agent it's judging?
If both are Claude, you have shared-vulnerability risk. Prompt-injection patterns that work against one often work against the other. Basic defense in depth says they should at least be different providers, ideally different architectures.
Secondary issue: the judge only sees what's in the HTTP body. Someone who can shape the request (via agent input) can shape the judge's context window too. That's a different failure mode than "judge gets tricked by clever prompting." It's "judge is starved of the signals it would need to spot the trick."
fareesh
Needs to be deterministic. ACLs
show comments
cadamsdotcom
> pointing it at a few days of real traffic produced policies that matched human judgment on the vast majority of held-out requests.
The problem is, 99% secure is a failing grade.
show comments
foreman_
The thread has converged on “LLM-as-judge is the wrong security primitive,” which is right as far as it goes. The prompt-injection chain ends at the outbound POST. By the time the judge sees the request, the credential has already been read.
The question edf13 pointed at but didn’t develop; where does a transport-layer judge earn its place at all? Not as the enforcement layer but as the audit layer on top of one. Kernel-level controls tell you what the agent did. A proxy tells you what the agent tried to exfiltrate and where to.
Structured-JSON escaping and header caps are good tools for the detection job. They’re the wrong tools for the prevention job. Different layers, different questions.
Seventeen18
So cool ! I'm building something very close to that but from another perspective, making this open source is giving me many idea !
qwertyuiop_
Non-deterministic business rules engine.
IntrepidPig
Blatant “astroturfing” in these comments
DANmode
We’re supposed to be fixing LLM security by adding a non-LLM layer to it,
not adding LLM layers to stuff to make them inherently less secure.
This will be a neat concept for the types of tools that come after the present iteration of LLMs.
Interesting approach! I’ve been building something complementary on the deterministic side. LLM-as-judge guardrails are fundamentally probabilistic and can be gamed or hallucinate themselves (as several comments pointed out). That’s why I built EvalView — it does full trajectory snapshots + diffs so you can see exactly what changed, plus a lightweight zero-judge model-check that directly pings the model and reports drift level (NONE / WEAK / MEDIUM / STRONG). Gives you deterministic regression detection that works alongside (or instead of) LLM judges. https://github.com/hidai25/eval-view Curious how you handle drift detection in CrabTrap.
Comments like this don't fill me with confidence: https://github.com/brexhq/CrabTrap/blob/4fbbda9ca00055c1554a...
Really cool! I'm also building something in this space but taking a slightly different approach. I'm glad to see more focus on security for production agentic workflows though, as I think we don't talk about it enough when it comes to claws and other autonomous agents.
I think you're spot on with the fact that it's so far it's been either all or nothing. You either give an agent a lot of access and it's really powerful but proportionally dangerous or you lock it down so much that it's no longer useful.
I like a lot of the ideas you show here, but I also worry that LLM-as-a-judge is fundamentally a probabilistic guardrail that is inherently limited. How do you see this? It feels dangerous to rely on a security system that's not based on hard limitations but rather probabilities?
It's all fine until OpenClaw decides to start prompt injecting the judge
The debate here is missing a practical question: is the judge from the same model family as the agent it's judging?
If both are Claude, you have shared-vulnerability risk. Prompt-injection patterns that work against one often work against the other. Basic defense in depth says they should at least be different providers, ideally different architectures.
Secondary issue: the judge only sees what's in the HTTP body. Someone who can shape the request (via agent input) can shape the judge's context window too. That's a different failure mode than "judge gets tricked by clever prompting." It's "judge is starved of the signals it would need to spot the trick."
Needs to be deterministic. ACLs
> pointing it at a few days of real traffic produced policies that matched human judgment on the vast majority of held-out requests.
The problem is, 99% secure is a failing grade.
The thread has converged on “LLM-as-judge is the wrong security primitive,” which is right as far as it goes. The prompt-injection chain ends at the outbound POST. By the time the judge sees the request, the credential has already been read.
The question edf13 pointed at but didn’t develop; where does a transport-layer judge earn its place at all? Not as the enforcement layer but as the audit layer on top of one. Kernel-level controls tell you what the agent did. A proxy tells you what the agent tried to exfiltrate and where to.
Structured-JSON escaping and header caps are good tools for the detection job. They’re the wrong tools for the prevention job. Different layers, different questions.
So cool ! I'm building something very close to that but from another perspective, making this open source is giving me many idea !
Non-deterministic business rules engine.
Blatant “astroturfing” in these comments
We’re supposed to be fixing LLM security by adding a non-LLM layer to it,
not adding LLM layers to stuff to make them inherently less secure.
This will be a neat concept for the types of tools that come after the present iteration of LLMs.
Unless I’m sorely mistaken.