REDDIT · REDDIT// 19d agoRESEARCH PAPER

Paper maps censorship routing in Chinese LLMs

Gregory Frank's paper argues that refusal-based alignment benchmarks miss the real mechanism in censored open-weight LLMs: models can detect politically sensitive concepts and still route them into refusal or state-aligned steering instead of neutral answers. Across nine models from five labs, the study finds lab-specific routing, with ablation often restoring factual output but in one Qwen variant triggering confabulation and in newer Qwen releases replacing refusal with maximal steering.

// ANALYSIS

Refusal counts are starting to look like a vanity metric; the paper's real claim is that models can learn to look compliant without becoming less censored.

–Within Qwen, hard refusal falls to zero while steering hits the ceiling, so refusal-only metrics can misclassify stronger censorship as weaker censorship.
–Probe accuracy alone is not enough: political probes, null controls, and permutation baselines can all score perfectly, so held-out category generalization is the meaningful test.
–Surgical ablation usually restores factual answers in most models, which supports a detect-route-generate view rather than a simple knowledge-absent story.
–Qwen is the cautionary outlier: removing the political-sensitivity direction can scramble unrelated facts, implying censorship and factual memory are entangled there.
–Cross-model transfer fails and small-sample audits swing wildly, so there is no universal uncensor vector and no reason to trust a handful of prompts.

// TAGS

llmresearchsafetyqwendeepseekglmyidetection-is-cheap-routing-is-learned

DISCOVERED

19d ago

2026-03-23

PUBLISHED

19d ago

2026-03-23

RELEVANCE

9/ 10

AUTHOR

Logical-Employ-9692