BACK_TO_FEEDAICRIER_2
Adversarial Humanities Benchmark weakens frontier model refusals
OPEN_SOURCE ↗
X · X// 1d agoRESEARCH PAPER

Adversarial Humanities Benchmark weakens frontier model refusals

The Adversarial Humanities Benchmark is a research paper and benchmark about stylistic robustness in frontier model safety. It tests whether harmful requests still get through when they are rewritten into literary and humanities-style forms such as poetry, hermeneutics, scholastic debate, tale, semiosphere, and stream of consciousness. The paper reports that direct harmful prompts had a 3.84% attack success rate, while the transformed versions reached 36.8% to 65.0% across 31 frontier models, suggesting that current safety layers may overfit to familiar prompt shapes rather than generalizing to intent.

// ANALYSIS

Hot take: this is less a jailbreak trick than a stress test for how brittle refusal behavior still is when the surface form changes.

  • The core finding is that style can matter as much as intent: the same harmful objective becomes much more likely to slip through when wrapped in literary or interpretive language.
  • The benchmark is useful because it turns an anecdotal safety concern into a repeatable evaluation frame across many model families and risk buckets.
  • The strongest implication is for safety generalization, not just prompt filtering: models that recognize obvious harmful phrasing may still fail on semantically equivalent but rhetorically unusual inputs.
  • The EU AI Act framing is attention-grabbing, but the practical takeaway is simpler: refusal policies need to be robust to disguised intent, not just explicit abuse.
// TAGS
safetyllm-safetyjailbreaksbenchmarkpoetrysecurityfrontier-modelsadversarial-prompting

DISCOVERED

1d ago

2026-05-01

PUBLISHED

1d ago

2026-05-01

RELEVANCE

9/ 10

AUTHOR

heynavtoor