OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoRESEARCH PAPER
Sparse gate-amplifier circuit drives LLM refusal
Researchers identified a universal "gate-amplifier" circuit across 12 open-weights models that triggers refusal behavior. The study demonstrates that alignment acts as a fragile routing layer rather than erasing harmful knowledge, leaving models vulnerable to simple cipher-based bypasses.
// ANALYSIS
This research confirms that alignment is a shallow routing layer, not a fundamental "lobotomy" of model knowledge.
- –The "gate" head is causally necessary for refusal but contributes under 1% to final logit attribution, hiding its importance from standard analysis.
- –Scaling from 2B to 72B parameters causes these circuits to spread from single heads into "bands" across adjacent layers, complicating simple ablation.
- –In-context ciphers bypass the gate's pattern-matching, proving the model's underlying capabilities remain intact despite safety training.
- –The motif's consistency across 6 major labs suggests it is a fundamental artifact of current RLHF and DPO techniques.
- –Interchange testing, rather than simple ablation, is required to identify these "stealthy" but critical routing components as models scale.
// TAGS
llmsafetyresearchopen-weightsalignmentinterpretabilityphi-4-miniqwensparse-gate-amplifier-circuit
DISCOVERED
3h ago
2026-04-15
PUBLISHED
7h ago
2026-04-14
RELEVANCE
9/ 10
AUTHOR
Logical-Employ-9692