Sparse gate-amplifier circuit drives LLM refusal

// 45d agoRESEARCH PAPER

Sparse gate-amplifier circuit drives LLM refusal

Researchers identified a universal "gate-amplifier" circuit across 12 open-weights models that triggers refusal behavior. The study demonstrates that alignment acts as a fragile routing layer rather than erasing harmful knowledge, leaving models vulnerable to simple cipher-based bypasses.

// ANALYSIS

This research confirms that alignment is a shallow routing layer, not a fundamental "lobotomy" of model knowledge.

–The "gate" head is causally necessary for refusal but contributes under 1% to final logit attribution, hiding its importance from standard analysis.
–Scaling from 2B to 72B parameters causes these circuits to spread from single heads into "bands" across adjacent layers, complicating simple ablation.
–In-context ciphers bypass the gate's pattern-matching, proving the model's underlying capabilities remain intact despite safety training.
–The motif's consistency across 6 major labs suggests it is a fundamental artifact of current RLHF and DPO techniques.
–Interchange testing, rather than simple ablation, is required to identify these "stealthy" but critical routing components as models scale.

// TAGS

llmsafetyresearchopen-weightsalignmentinterpretabilityphi-4-miniqwensparse-gate-amplifier-circuit

DISCOVERED

45d ago

2026-04-15

PUBLISHED

45d ago

2026-04-14

RELEVANCE

9/ 10

AUTHOR

Logical-Employ-9692

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE2h ago

Claude Code defaults to Opus 4.8

Claude Code v2.1.154 promotes Opus 4.8 to the default high-effort model, adds dynamic workflows that can orchestrate work across dozens to hundreds of background agents, and improves fast mode economics and speed on Opus 4.8. The release also refines cleanup flows with a lighter `/simplify` path, renames effort labels for clarity, and tightens several CLI and agent workflows for heavier terminal-based coding sessions.

TUTORIAL2h ago

Unstract tutorial covers local setup

This YouTube walkthrough shows how to self-host Unstract, the open-source document extraction platform, with Docker and local model support. It positions the tool as a practical fit for offline and private RAG-style workflows that turn PDFs and other files into structured outputs.

NEWS2h ago

Uber's Claude Code bill tests AI ROI

The video uses Uber’s reported Claude Code spend as a concrete example of the rising tension around agentic coding tools: usage can scale quickly inside engineering teams, but leadership is still struggling to connect that spend to shipped consumer features. It frames Claude Code as genuinely useful, but also as the kind of token-heavy workflow that is easy to adopt and hard to justify when budgets tighten.

Sparse gate-amplifier circuit drives LLM refusal