BACK_TO_FEEDAICRIER_2
Anthropic maps assistant axis, caps drift
OPEN_SOURCE ↗
YT · YOUTUBE// 25d agoRESEARCH PAPER

Anthropic maps assistant axis, caps drift

Anthropic’s January 19, 2026 research introduces the “Assistant Axis,” a dominant activation-space direction tied to assistant-like behavior, and shows that capping drift along this axis can sharply reduce persona-based jailbreak success. The paper reports roughly a 50% drop in harmful responses while keeping capability benchmarks largely intact.

// ANALYSIS

This is one of the strongest examples yet of a targeted safety control that acts like guardrails instead of a blanket behavior clamp.

  • The method is mechanistic and interpretable: monitor a specific latent direction rather than relying only on prompt-level filtering.
  • Results span multiple open models (Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B), suggesting the persona geometry may generalize.
  • It addresses both adversarial jailbreak prompts and organic long-chat drift, especially in emotionally vulnerable or meta-reflective conversations.
  • The open question is external validity: gains in controlled evals still need proof under messy, real-world deployment dynamics.
// TAGS
assistant-axisllmsafetyresearchbenchmark

DISCOVERED

25d ago

2026-03-17

PUBLISHED

25d ago

2026-03-17

RELEVANCE

9/ 10

AUTHOR

Two Minute Papers