OPEN_SOURCE ↗
YT · YOUTUBE// 37d agoRESEARCH PAPER
H-Neurons maps sparse LLM hallucination circuits
THUNLP’s H-Neurons paper and official code release argue that fewer than 0.1% of feedforward neurons can predict hallucinations across multiple LLMs and evaluation settings, including out-of-domain and fabricated-question cases. The GitHub repo ships the full pipeline for collecting responses, extracting answer tokens, training sparse classifiers, and testing neuron-level interventions.
// ANALYSIS
This is a strong mechanistic interpretability result, not proof that hallucinations are “solved.” What makes it notable is the claim that hallucination behavior is concentrated, measurable, and partially controllable at the neuron level rather than just a fuzzy system-level failure.
- –The core result is sparsity: a tiny neuron subset beats random-neuron baselines on hallucination detection across TriviaQA, NQ-Open, BioASQ, and fabricated-entity prompts
- –The intervention experiments tie these neurons to broader over-compliance behaviors, including accepting false premises, misleading context, sycophancy, and harmful instructions
- –The transfer results suggest these circuits already exist in base models, which points the finger at pretraining rather than alignment alone
- –The repo makes the work more useful for researchers because it exposes a concrete workflow for probing models instead of stopping at paper-level claims
- –It still falls short of a deployable fix, since suppressing or amplifying neurons is delicate and can trade off against general model utility
// TAGS
h-neuronsllmresearchsafetyopen-source
DISCOVERED
37d ago
2026-03-06
PUBLISHED
37d ago
2026-03-06
RELEVANCE
9/ 10
AUTHOR
AI Search