OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoRESEARCH PAPER
Detection, Routing Paper Exposes Refusal Benchmark Blind Spot
Using political censorship in Chinese-origin LLMs as a natural experiment, the paper argues alignment often happens in a learned detect -> route -> generate layer rather than in concept detection or refusal alone. Across nine open-weight models, held-out generalization and causal ablation surfaced lab-specific behavior that refusal-only benchmarks miss.
// ANALYSIS
This is the right critique of alignment evals: probeability and refusal are easy to measure, but neither proves you've found the mechanism that actually changes behavior.
- –Probe accuracy is not doing any real discriminating here: political, null-topic, and shuffled-label probes all hit 100%, so the held-out category test is the first result that looks like a real measurement.
- –Causal intervention is the strongest evidence in the paper: in 3 of 4 models, ablating the censorship direction restored factual answers, while Qwen3-8B confabulated when the architecture fused knowledge with the censorship signal.
- –The routing directions are lab-specific, not universal: political and safety directions are mostly orthogonal, GLM's coupling changes with prompt corpus, and cross-model transfer collapses almost completely.
- –Refusal-only evals miss the behavior shift: some Qwen models moved from 25% refusal to 0% while narrative steering hit the ceiling, which means less refusal can still hide tighter censorship.
- –The broader lesson is bigger than censorship: safety training and other post-training edits likely change routing more than knowledge, so evaluators need causal and failure-mode evidence, not just probes.
// TAGS
detection-is-cheap-routing-is-learnedllmresearchsafetybenchmarkopen-weights
DISCOVERED
19d ago
2026-03-23
PUBLISHED
19d ago
2026-03-23
RELEVANCE
9/ 10
AUTHOR
Logical-Employ-9692