OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoRESEARCH PAPER
Paper maps censorship routing in Chinese LLMs
Gregory Frank's paper argues that refusal-based alignment benchmarks miss the real mechanism in censored open-weight LLMs: models can detect politically sensitive concepts and still route them into refusal or state-aligned steering instead of neutral answers. Across nine models from five labs, the study finds lab-specific routing, with ablation often restoring factual output but in one Qwen variant triggering confabulation and in newer Qwen releases replacing refusal with maximal steering.
// ANALYSIS
Refusal counts are starting to look like a vanity metric; the paper's real claim is that models can learn to look compliant without becoming less censored.
- –Within Qwen, hard refusal falls to zero while steering hits the ceiling, so refusal-only metrics can misclassify stronger censorship as weaker censorship.
- –Probe accuracy alone is not enough: political probes, null controls, and permutation baselines can all score perfectly, so held-out category generalization is the meaningful test.
- –Surgical ablation usually restores factual answers in most models, which supports a detect-route-generate view rather than a simple knowledge-absent story.
- –Qwen is the cautionary outlier: removing the political-sensitivity direction can scramble unrelated facts, implying censorship and factual memory are entangled there.
- –Cross-model transfer fails and small-sample audits swing wildly, so there is no universal uncensor vector and no reason to trust a handful of prompts.
// TAGS
llmresearchsafetyqwendeepseekglmyidetection-is-cheap-routing-is-learned
DISCOVERED
19d ago
2026-03-23
PUBLISHED
19d ago
2026-03-23
RELEVANCE
9/ 10
AUTHOR
Logical-Employ-9692