Anthropic maps assistant axis, caps drift

// 117d agoRESEARCH PAPER

Anthropic maps assistant axis, caps drift

Anthropic’s January 19, 2026 research introduces the “Assistant Axis,” a dominant activation-space direction tied to assistant-like behavior, and shows that capping drift along this axis can sharply reduce persona-based jailbreak success. The paper reports roughly a 50% drop in harmful responses while keeping capability benchmarks largely intact.

// ANALYSIS

This is one of the strongest examples yet of a targeted safety control that acts like guardrails instead of a blanket behavior clamp.

–The method is mechanistic and interpretable: monitor a specific latent direction rather than relying only on prompt-level filtering.
–Results span multiple open models (Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B), suggesting the persona geometry may generalize.
–It addresses both adversarial jailbreak prompts and organic long-chat drift, especially in emotionally vulnerable or meta-reflective conversations.
–The open question is external validity: gains in controlled evals still need proof under messy, real-world deployment dynamics.

// TAGS

assistant-axisllmsafetyresearchbenchmark

DISCOVERED

117d ago

2026-03-17

PUBLISHED

117d ago

2026-03-17

RELEVANCE

9/ 10

AUTHOR

Two Minute Papers

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO30m ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE30m ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS2h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.