BACK_TO_FEEDAICRIER_2
Qwen3.5-397B Abliteration Exposes MoE Refusals
OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoRESEARCH PAPER

Qwen3.5-397B Abliteration Exposes MoE Refusals

A Mac Studio experiment adapts FailSpy’s abliteration workflow to Qwen3.5-397B-A17B, claiming PRC-political censorship can be removed without breaking drug or weapons refusals. The post argues MoE models split refusal behavior across different activation routes, making inference-time hooks materially different from weight-baked edits.

// ANALYSIS

The interesting part is not the censorship angle, but the architectural claim: sparse MoE models may encode safety behavior in routing decisions that simple projection edits cannot fully erase.

  • If the routing hypothesis holds, refusal behavior in MoE models is not just a direction in residual space, which makes dense-model ablation intuitions unreliable
  • Weight-baking vs runtime hooks diverging is operationally important: a “fixed” checkpoint may still behave differently from an instrumented inference path
  • The top-k fragility on the 397B model suggests this technique is highly sensitive to scale and router geometry, not a plug-and-play recipe
  • The writeup is most useful as a reproducible probe for where model behavior actually lives, not just as a censorship-removal demo
  • The local-quantized workflow is notable because it lowers the barrier to this kind of mechanistic testing on consumer hardware
// TAGS
qwen3.5-397balfred-abliteratellmopen-weightssafetyresearchinference

DISCOVERED

6d ago

2026-04-06

PUBLISHED

6d ago

2026-04-06

RELEVANCE

9/ 10

AUTHOR

trevorbg