X · X// 4h agoRESEARCH PAPER

AHB exposes literary jailbreak weakness

Icaro Lab's Adversarial Humanities Benchmark tests whether frontier LLM refusals survive harmful prompts rewritten as literary or humanities-style text. The arXiv paper reports transformed prompts jumping from 3.84% attack success on originals to 36.8%-65.0% across 31 frontier models.

// ANALYSIS

This is a sharp warning for anyone shipping safety-critical LLM features: refusal tuning that works on obvious bad prompts may crumble when intent is wrapped in style, genre, or abstraction.

–The benchmark reframes MLCommons AILuminate hazards through styles like tales, hermeneutics, scholasticism, and stream of consciousness while preserving harmful intent
–A 55.75% overall attack success rate suggests current safety layers are pattern-matching too much surface form and not enough underlying objective
–Newer frontier models were not systematically safer on adversarial framing, which undercuts the easy assumption that scale and recency solve this class of problem
–Releasing the benchmark and dataset gives safety teams a concrete regression suite for prompt-obfuscation robustness, but also raises dual-use concerns

// TAGS

adversarial-humanities-benchmarkllmsafetyresearchbenchmarkprompt-engineering

DISCOVERED

4h ago

2026-04-23

PUBLISHED

11h ago

2026-04-23

RELEVANCE

8/ 10

AUTHOR

pcgamer