YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen 3.5 weights reveal internal censorship circuits

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen 3.5 weights reveal internal censorship circuits
OPEN LINK ↗
// 3h agoRESEARCH PAPER

Qwen 3.5 weights reveal internal censorship circuits

A mechanistic interpretability study of Alibaba’s Qwen 3.5-9B identifies a "Writer/Reader" circuit that steers sensitive topics toward deflection or state-aligned propaganda. Researchers found three primary internal vectors—d_prc, d_refuse, and d_style—that route model behavior, revealing that the model often "thinks" in Chinese tokens internally before generating English responses.

// ANALYSIS

Qwen 3.5 marks a shift from hard refusals to subtle narrative steering, making political bias an architectural feature rather than a surface-level filter.

  • Censorship is deeply entangled with factual knowledge; removing the filter via "abliteration" causes a 72% hallucination rate where historical events are swapped (e.g., Pearl Harbor for Tiananmen).
  • The model's internal "verdict" commits in Chinese layers even for English prompts, suggesting alignment was primarily performed on Chinese-language reasoning chains.
  • The 3D signal in the residual stream allows for granular control over response style, distinguishing between "neutral" deflection and "active" regime-defense propaganda.
  • This mechanism misfires on structurally similar non-PRC topics, applying CCP-style narrative defense to the Saudi government and Kosovo.
// TAGS
qwen3-5llmopen-weightsinterpretabilitysafetyresearch

DISCOVERED

3h ago

2026-05-19

PUBLISHED

7h ago

2026-05-19

RELEVANCE

9/ 10

AUTHOR

s314