BACK_TO_FEEDAICRIER_2
Sarvam reasoning models hit abliteration
OPEN_SOURCE ↗
REDDIT · REDDIT// 3d agoOPENSOURCE RELEASE

Sarvam reasoning models hit abliteration

Researcher Alosh Denny removes refusal mechanisms from India’s first multilingual MoE reasoning models, Sarvam-30B and 105B. The "abliteration" process reveals a unique dissociation between internal reasoning and final answer projection, showing that reasoning models can "think" toward compliance while still refusing in the output.

// ANALYSIS

Reasoning models possess a dual-circuit refusal architecture that complicates safety alignment beyond standard LLM techniques.

  • Identified two distinct refusal circuits: one in core reasoning layers and one at the final lm_head projection.
  • Refusal mechanisms are pre-linguistic; English-computed directions successfully uncensored Malayalam, Hindi, and Kannada outputs.
  • Uncensored 105B variant remains highly competitive, maintaining 98.6 on Math500 and 71.7 on Live Code Bench.
  • Surgical Refusal Ablation (SRA) proves that safety layers in reasoning-heavy MoEs are structural rather than language-dependent.
// TAGS
llmreasoningopen-weightsmultilingualsarvam-aisafetyabliteration

DISCOVERED

3d ago

2026-04-09

PUBLISHED

3d ago

2026-04-08

RELEVANCE

8/ 10

AUTHOR

Available-Deer1723