BACK_TO_FEEDAICRIER_2
Detection, Routing Paper Exposes Refusal Benchmark Blind Spot
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoRESEARCH PAPER

Detection, Routing Paper Exposes Refusal Benchmark Blind Spot

Using political censorship in Chinese-origin LLMs as a natural experiment, the paper argues alignment often happens in a learned detect -> route -> generate layer rather than in concept detection or refusal alone. Across nine open-weight models, held-out generalization and causal ablation surfaced lab-specific behavior that refusal-only benchmarks miss.

// ANALYSIS

This is the right critique of alignment evals: probeability and refusal are easy to measure, but neither proves you've found the mechanism that actually changes behavior.

  • Probe accuracy is not doing any real discriminating here: political, null-topic, and shuffled-label probes all hit 100%, so the held-out category test is the first result that looks like a real measurement.
  • Causal intervention is the strongest evidence in the paper: in 3 of 4 models, ablating the censorship direction restored factual answers, while Qwen3-8B confabulated when the architecture fused knowledge with the censorship signal.
  • The routing directions are lab-specific, not universal: political and safety directions are mostly orthogonal, GLM's coupling changes with prompt corpus, and cross-model transfer collapses almost completely.
  • Refusal-only evals miss the behavior shift: some Qwen models moved from 25% refusal to 0% while narrative steering hit the ceiling, which means less refusal can still hide tighter censorship.
  • The broader lesson is bigger than censorship: safety training and other post-training edits likely change routing more than knowledge, so evaluators need causal and failure-mode evidence, not just probes.
// TAGS
detection-is-cheap-routing-is-learnedllmresearchsafetybenchmarkopen-weights

DISCOVERED

19d ago

2026-03-23

PUBLISHED

19d ago

2026-03-23

RELEVANCE

9/ 10

AUTHOR

Logical-Employ-9692