OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Arc Sentry tops LlamaGuard on indirect attacks
Arc Sentry is a white-box prompt injection detector for self-hosted LLMs like Mistral, Llama, and Qwen. In a 40-prompt OOD benchmark covering indirect, hypothetical, and roleplay attacks, it posted 0.80 recall and 0.84 F1, beating LlamaGuard 3 8B on recall while blocking before `model.generate()`.
// ANALYSIS
The interesting part is not just the score bump; it is the detection strategy. If the model can be probed through its internal representation before generation, keyword filters and surface-level classifiers become much easier to evade.
- –Best-in-class recall on the reported benchmark matters most for security use cases, because missed injections are the expensive failure mode
- –The benchmark is small and narrow, so the result is a strong prototype signal, not proof of broad generalization
- –The tradeoff is visible in the numbers: OpenAI Moderation API had higher F1, so Arc Sentry looks optimized for catching more attacks rather than winning every balanced metric
- –CPU pre-filtering and no model access make it practical for self-hosted deployments where latency and isolation matter
- –The main question now is calibration across real workloads, not whether prompt-injection defense needs to move beyond pattern matching
// TAGS
arc-sentryllmsafetyself-hostedopen-sourcebenchmarkprompt-engineering
DISCOVERED
3h ago
2026-04-27
PUBLISHED
6h ago
2026-04-27
RELEVANCE
9/ 10
AUTHOR
Turbulent-Tap6723