OPEN_SOURCE ↗
X · X// 4h agoRESEARCH PAPER
AHB exposes literary jailbreak weakness
Icaro Lab's Adversarial Humanities Benchmark tests whether frontier LLM refusals survive harmful prompts rewritten as literary or humanities-style text. The arXiv paper reports transformed prompts jumping from 3.84% attack success on originals to 36.8%-65.0% across 31 frontier models.
// ANALYSIS
This is a sharp warning for anyone shipping safety-critical LLM features: refusal tuning that works on obvious bad prompts may crumble when intent is wrapped in style, genre, or abstraction.
- –The benchmark reframes MLCommons AILuminate hazards through styles like tales, hermeneutics, scholasticism, and stream of consciousness while preserving harmful intent
- –A 55.75% overall attack success rate suggests current safety layers are pattern-matching too much surface form and not enough underlying objective
- –Newer frontier models were not systematically safer on adversarial framing, which undercuts the easy assumption that scale and recency solve this class of problem
- –Releasing the benchmark and dataset gives safety teams a concrete regression suite for prompt-obfuscation robustness, but also raises dual-use concerns
// TAGS
adversarial-humanities-benchmarkllmsafetyresearchbenchmarkprompt-engineering
DISCOVERED
4h ago
2026-04-23
PUBLISHED
11h ago
2026-04-23
RELEVANCE
8/ 10
AUTHOR
pcgamer