Qwen 3.5 weights reveal internal censorship circuits
A mechanistic interpretability study of Alibaba’s Qwen 3.5-9B identifies a "Writer/Reader" circuit that steers sensitive topics toward deflection or state-aligned propaganda. Researchers found three primary internal vectors—d_prc, d_refuse, and d_style—that route model behavior, revealing that the model often "thinks" in Chinese tokens internally before generating English responses.
Qwen 3.5 marks a shift from hard refusals to subtle narrative steering, making political bias an architectural feature rather than a surface-level filter.
- –Censorship is deeply entangled with factual knowledge; removing the filter via "abliteration" causes a 72% hallucination rate where historical events are swapped (e.g., Pearl Harbor for Tiananmen).
- –The model's internal "verdict" commits in Chinese layers even for English prompts, suggesting alignment was primarily performed on Chinese-language reasoning chains.
- –The 3D signal in the residual stream allows for granular control over response style, distinguishing between "neutral" deflection and "active" regime-defense propaganda.
- –This mechanism misfires on structurally similar non-PRC topics, applying CCP-style narrative defense to the Saudi government and Kosovo.
DISCOVERED
3h ago
2026-05-19
PUBLISHED
7h ago
2026-05-19
RELEVANCE
AUTHOR
s314