OPEN_SOURCE ↗
REDDIT · REDDIT// 3d agoOPENSOURCE RELEASE
Sarvam reasoning models hit abliteration
Researcher Alosh Denny removes refusal mechanisms from India’s first multilingual MoE reasoning models, Sarvam-30B and 105B. The "abliteration" process reveals a unique dissociation between internal reasoning and final answer projection, showing that reasoning models can "think" toward compliance while still refusing in the output.
// ANALYSIS
Reasoning models possess a dual-circuit refusal architecture that complicates safety alignment beyond standard LLM techniques.
- –Identified two distinct refusal circuits: one in core reasoning layers and one at the final lm_head projection.
- –Refusal mechanisms are pre-linguistic; English-computed directions successfully uncensored Malayalam, Hindi, and Kannada outputs.
- –Uncensored 105B variant remains highly competitive, maintaining 98.6 on Math500 and 71.7 on Live Code Bench.
- –Surgical Refusal Ablation (SRA) proves that safety layers in reasoning-heavy MoEs are structural rather than language-dependent.
// TAGS
llmreasoningopen-weightsmultilingualsarvam-aisafetyabliteration
DISCOVERED
3d ago
2026-04-09
PUBLISHED
3d ago
2026-04-08
RELEVANCE
8/ 10
AUTHOR
Available-Deer1723