Nous Research drops Contrastive Neuron Attribution method
Nous Research has unveiled Contrastive Neuron Attribution (CNA), a mechanistic interpretability method that steers LLM behavior by isolating sparse circuits of under 200 neurons. The technique enables precise suppression or amplification of specific behaviors like refusal without degrading model coherence.
CNA is a major step forward for model steering, moving past noisy "activation additions" toward surgical neuron manipulation. It provides a way to "lobotomize" safety guardrails or enhance specific capabilities by touching only 0.1% of the model. The method identifies behavioral circuits like the "refusal gate" using only forward passes, bypassing the need for expensive Sparse Autoencoders. It reveals that alignment fine-tuning crystallizes existing discrimination features found in base models rather than creating new structures. The toolkit is available as the neural-steering repository on GitHub, offering stability at high intervention strengths where previous methods caused model collapse.
DISCOVERED
2h ago
2026-05-23
PUBLISHED
2h ago
2026-05-23
RELEVANCE
AUTHOR
NousResearch
