BACK_TO_FEEDAICRIER_2
Nemotron-3-Super-120B runs uncensored on Apple Silicon
OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoOPENSOURCE RELEASE

Nemotron-3-Super-120B runs uncensored on Apple Silicon

A community release strips safety guardrails from NVIDIA's hybrid Nemotron-Super-120B model using CRACK weight surgery, producing a 4-bit MLX-quantized variant that runs at 43–58 tok/s on Apple Silicon. HumanEval scores of 94% confirm coding capability is largely preserved post-modification.

// ANALYSIS

Community uncensored releases of frontier-class models keep pace with official launches, and Nemotron's novel hybrid architecture made this a genuinely hard technical problem to solve.

  • Nemotron-Super-120B's unique three-pathway design — 40 Mamba-2 SSM layers, 40 LatentMoE layers (512 experts, top-22), and 8 attention layers — breaks standard fp16-then-quantize workflows; all surgery must happen at quantization level
  • CRACK weight surgery targets the architectural convergence point of all three pathway types, suppressing refusal behavior at the weight level rather than via prompt injection or fine-tuning
  • 4-bit MLX quant achieves 43–58 tok/s on M3 Ultra 256GB, putting 120B-class local inference within reach for well-equipped Mac users
  • LM Studio silently drops 697 essential tensors and is incompatible — only MLX Studio or vMLX work correctly, a notable gotcha for the community
  • A chat template workaround introduced occasional missing closing think tags, an acknowledged tradeoff of the approach
// TAGS
llmopen-weightsopen-sourceself-hostedinferencebenchmark

DISCOVERED

29d ago

2026-03-14

PUBLISHED

29d ago

2026-03-14

RELEVANCE

6/ 10

AUTHOR

HealthyCommunicat