BACK_TO_FEEDAICRIER_2
AHB exposes literary jailbreak weakness
OPEN_SOURCE ↗
X · X// 4h agoRESEARCH PAPER

AHB exposes literary jailbreak weakness

Icaro Lab's Adversarial Humanities Benchmark tests whether frontier LLM refusals survive harmful prompts rewritten as literary or humanities-style text. The arXiv paper reports transformed prompts jumping from 3.84% attack success on originals to 36.8%-65.0% across 31 frontier models.

// ANALYSIS

This is a sharp warning for anyone shipping safety-critical LLM features: refusal tuning that works on obvious bad prompts may crumble when intent is wrapped in style, genre, or abstraction.

  • The benchmark reframes MLCommons AILuminate hazards through styles like tales, hermeneutics, scholasticism, and stream of consciousness while preserving harmful intent
  • A 55.75% overall attack success rate suggests current safety layers are pattern-matching too much surface form and not enough underlying objective
  • Newer frontier models were not systematically safer on adversarial framing, which undercuts the easy assumption that scale and recency solve this class of problem
  • Releasing the benchmark and dataset gives safety teams a concrete regression suite for prompt-obfuscation robustness, but also raises dual-use concerns
// TAGS
adversarial-humanities-benchmarkllmsafetyresearchbenchmarkprompt-engineering

DISCOVERED

4h ago

2026-04-23

PUBLISHED

11h ago

2026-04-23

RELEVANCE

8/ 10

AUTHOR

pcgamer