Arc Sentry catches Crescendo, LLM Guard misses
Arc Sentry claims it caught a multi-turn Crescendo jailbreak at Turn 3 by watching the model’s residual stream instead of the prompt text. The post contrasts that with LLM Guard’s 0/8 detection on the same attack.
The interesting part is the layer, not the score. If the claim holds up, session-aware whitebox monitoring is materially different from text classifiers for attacks that are designed to look benign turn by turn.
- –LLM Guard is facing the wrong problem shape here: Crescendo is built to evade per-turn text checks, so independent prompt scoring is structurally disadvantaged.
- –Arc Sentry’s residual-stream approach matches the failure mode better because the attack is about gradual state drift, not explicit toxic wording.
- –The headline benchmark is still vendor-run and narrow, so I’d want independent replication, calibration details, and real-world false-positive data before treating the 92% claim as settled.
- –The Arc Gate reference matters because it suggests the same stability idea is being extended from open-weight, whitebox monitoring to hosted API governance.
DISCOVERED
1h ago
2026-05-24
PUBLISHED
9h ago
2026-05-23
RELEVANCE
AUTHOR
Turbulent-Tap6723