BACK_TO_FEEDAICRIER_2
Sparse gate-amplifier circuit drives LLM refusal
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoRESEARCH PAPER

Sparse gate-amplifier circuit drives LLM refusal

Researchers identified a universal "gate-amplifier" circuit across 12 open-weights models that triggers refusal behavior. The study demonstrates that alignment acts as a fragile routing layer rather than erasing harmful knowledge, leaving models vulnerable to simple cipher-based bypasses.

// ANALYSIS

This research confirms that alignment is a shallow routing layer, not a fundamental "lobotomy" of model knowledge.

  • The "gate" head is causally necessary for refusal but contributes under 1% to final logit attribution, hiding its importance from standard analysis.
  • Scaling from 2B to 72B parameters causes these circuits to spread from single heads into "bands" across adjacent layers, complicating simple ablation.
  • In-context ciphers bypass the gate's pattern-matching, proving the model's underlying capabilities remain intact despite safety training.
  • The motif's consistency across 6 major labs suggests it is a fundamental artifact of current RLHF and DPO techniques.
  • Interchange testing, rather than simple ablation, is required to identify these "stealthy" but critical routing components as models scale.
// TAGS
llmsafetyresearchopen-weightsalignmentinterpretabilityphi-4-miniqwensparse-gate-amplifier-circuit

DISCOVERED

3h ago

2026-04-15

PUBLISHED

7h ago

2026-04-14

RELEVANCE

9/ 10

AUTHOR

Logical-Employ-9692