REDDIT · REDDIT// 3d agoMODEL RELEASE

Sarvam models abliterated, dual refusal circuits found

Researcher Alosh Denny has "abliterated" Sarvam AI's 30B and 105B reasoning models, discovering that these MoE architectures employ dual refusal circuits that can operate independently during internal reasoning versus final output.

// ANALYSIS

This project provides a rare peek into the mechanics of MoE reasoning models, proving that alignment is a multi-layered process that can be mathematically bypassed.

–Dual circuits in reasoning models suggest that internal CoT can be compliant even while the final output is forced into refusal.
–The "pre-linguistic" nature of refusal is a significant finding; a single English-derived direction can uncensor a model across dozens of Indian languages.
–Surgical weight projection remains the most effective way to uncensor models without the catastrophic forgetting associated with fine-tuning.
–These models represent the first major "abliteration" of a reasoning-specialized MoE architecture, demonstrating that the methodology scales to complex MoE setups.

// TAGS

sarvam-30b-105b-uncensoredllmreasoningopen-weightsfine-tuningsafetyresearch

DISCOVERED

3d ago

2026-04-09

PUBLISHED

3d ago

2026-04-08

RELEVANCE

8/ 10

AUTHOR

Available-Deer1723