OPEN_SOURCE ↗
REDDIT · REDDIT// 3d agoMODEL RELEASE
Sarvam models abliterated, dual refusal circuits found
Researcher Alosh Denny has "abliterated" Sarvam AI's 30B and 105B reasoning models, discovering that these MoE architectures employ dual refusal circuits that can operate independently during internal reasoning versus final output.
// ANALYSIS
This project provides a rare peek into the mechanics of MoE reasoning models, proving that alignment is a multi-layered process that can be mathematically bypassed.
- –Dual circuits in reasoning models suggest that internal CoT can be compliant even while the final output is forced into refusal.
- –The "pre-linguistic" nature of refusal is a significant finding; a single English-derived direction can uncensor a model across dozens of Indian languages.
- –Surgical weight projection remains the most effective way to uncensor models without the catastrophic forgetting associated with fine-tuning.
- –These models represent the first major "abliteration" of a reasoning-specialized MoE architecture, demonstrating that the methodology scales to complex MoE setups.
// TAGS
sarvam-30b-105b-uncensoredllmreasoningopen-weightsfine-tuningsafetyresearch
DISCOVERED
3d ago
2026-04-09
PUBLISHED
3d ago
2026-04-08
RELEVANCE
8/ 10
AUTHOR
Available-Deer1723