BACK_TO_FEEDAICRIER_2
Arc Sentry tops LlamaGuard on indirect attacks
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Arc Sentry tops LlamaGuard on indirect attacks

Arc Sentry is a white-box prompt injection detector for self-hosted LLMs like Mistral, Llama, and Qwen. In a 40-prompt OOD benchmark covering indirect, hypothetical, and roleplay attacks, it posted 0.80 recall and 0.84 F1, beating LlamaGuard 3 8B on recall while blocking before `model.generate()`.

// ANALYSIS

The interesting part is not just the score bump; it is the detection strategy. If the model can be probed through its internal representation before generation, keyword filters and surface-level classifiers become much easier to evade.

  • Best-in-class recall on the reported benchmark matters most for security use cases, because missed injections are the expensive failure mode
  • The benchmark is small and narrow, so the result is a strong prototype signal, not proof of broad generalization
  • The tradeoff is visible in the numbers: OpenAI Moderation API had higher F1, so Arc Sentry looks optimized for catching more attacks rather than winning every balanced metric
  • CPU pre-filtering and no model access make it practical for self-hosted deployments where latency and isolation matter
  • The main question now is calibration across real workloads, not whether prompt-injection defense needs to move beyond pattern matching
// TAGS
arc-sentryllmsafetyself-hostedopen-sourcebenchmarkprompt-engineering

DISCOVERED

3h ago

2026-04-27

PUBLISHED

6h ago

2026-04-27

RELEVANCE

9/ 10

AUTHOR

Turbulent-Tap6723