Qwen 3.5 weights reveal internal censorship circuits

// 45d agoRESEARCH PAPER

Qwen 3.5 weights reveal internal censorship circuits

A mechanistic interpretability study of Alibaba’s Qwen 3.5-9B identifies a "Writer/Reader" circuit that steers sensitive topics toward deflection or state-aligned propaganda. Researchers found three primary internal vectors—d_prc, d_refuse, and d_style—that route model behavior, revealing that the model often "thinks" in Chinese tokens internally before generating English responses.

// ANALYSIS

Qwen 3.5 marks a shift from hard refusals to subtle narrative steering, making political bias an architectural feature rather than a surface-level filter.

–Censorship is deeply entangled with factual knowledge; removing the filter via "abliteration" causes a 72% hallucination rate where historical events are swapped (e.g., Pearl Harbor for Tiananmen).
–The model's internal "verdict" commits in Chinese layers even for English prompts, suggesting alignment was primarily performed on Chinese-language reasoning chains.
–The 3D signal in the residual stream allows for granular control over response style, distinguishing between "neutral" deflection and "active" regime-defense propaganda.
–This mechanism misfires on structurally similar non-PRC topics, applying CCP-style narrative defense to the Saudi government and Kosovo.

// TAGS

qwen3-5llmopen-weightsinterpretabilitysafetyresearch

DISCOVERED

45d ago

2026-05-19

PUBLISHED

45d ago

2026-05-19

RELEVANCE

9/ 10

AUTHOR

s314

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

RESEARCH1h ago

Nando de Freitas calls for CANDI benchmarks

A discussion on X regarding the CANDI paper, which explores text diffusion. While the analysis is praised as a wonderful new direction beyond just perplexity metrics, there is an urgent call for post-training benchmark results and direct performance comparisons against established autoregressive LLMs to properly measure any existing performance gaps.

UPDATE1h ago

OpenAI replaces ChatGPT Canvas with Writing Blocks

OpenAI has updated ChatGPT by replacing its side-by-side Canvas editing mode with inline Writing Blocks to streamline document and code drafting. While the new interface is intended to improve latency and simplify the layout, early user feedback indicates that the editor within Writing Blocks is less capable, making simple rewrites that were previously effortless feel frustratingly limited.

UPDATE3h ago

Grok reportedly coming to X group chats

A leak reveals xAI's Grok chatbot is coming to X group chats, enabling users to interact with the AI assistant in group messaging like a Discord bot. While users are excited for quick thread queries, some note limitations such as Grok's lack of long-term memory.