OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoRESEARCH PAPER
Qwen3.5-397B Abliteration Exposes MoE Refusals
A Mac Studio experiment adapts FailSpy’s abliteration workflow to Qwen3.5-397B-A17B, claiming PRC-political censorship can be removed without breaking drug or weapons refusals. The post argues MoE models split refusal behavior across different activation routes, making inference-time hooks materially different from weight-baked edits.
// ANALYSIS
The interesting part is not the censorship angle, but the architectural claim: sparse MoE models may encode safety behavior in routing decisions that simple projection edits cannot fully erase.
- –If the routing hypothesis holds, refusal behavior in MoE models is not just a direction in residual space, which makes dense-model ablation intuitions unreliable
- –Weight-baking vs runtime hooks diverging is operationally important: a “fixed” checkpoint may still behave differently from an instrumented inference path
- –The top-k fragility on the 397B model suggests this technique is highly sensitive to scale and router geometry, not a plug-and-play recipe
- –The writeup is most useful as a reproducible probe for where model behavior actually lives, not just as a censorship-removal demo
- –The local-quantized workflow is notable because it lowers the barrier to this kind of mechanistic testing on consumer hardware
// TAGS
qwen3.5-397balfred-abliteratellmopen-weightssafetyresearchinference
DISCOVERED
6d ago
2026-04-06
PUBLISHED
6d ago
2026-04-06
RELEVANCE
9/ 10
AUTHOR
trevorbg