OPEN_SOURCE ↗
YT · YOUTUBE// 25d agoRESEARCH PAPER
Anthropic maps assistant axis, caps drift
Anthropic’s January 19, 2026 research introduces the “Assistant Axis,” a dominant activation-space direction tied to assistant-like behavior, and shows that capping drift along this axis can sharply reduce persona-based jailbreak success. The paper reports roughly a 50% drop in harmful responses while keeping capability benchmarks largely intact.
// ANALYSIS
This is one of the strongest examples yet of a targeted safety control that acts like guardrails instead of a blanket behavior clamp.
- –The method is mechanistic and interpretable: monitor a specific latent direction rather than relying only on prompt-level filtering.
- –Results span multiple open models (Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B), suggesting the persona geometry may generalize.
- –It addresses both adversarial jailbreak prompts and organic long-chat drift, especially in emotionally vulnerable or meta-reflective conversations.
- –The open question is external validity: gains in controlled evals still need proof under messy, real-world deployment dynamics.
// TAGS
assistant-axisllmsafetyresearchbenchmark
DISCOVERED
25d ago
2026-03-17
PUBLISHED
25d ago
2026-03-17
RELEVANCE
9/ 10
AUTHOR
Two Minute Papers