OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoOPENSOURCE RELEASE
Nemotron-3-Super-120B runs uncensored on Apple Silicon
A community release strips safety guardrails from NVIDIA's hybrid Nemotron-Super-120B model using CRACK weight surgery, producing a 4-bit MLX-quantized variant that runs at 43–58 tok/s on Apple Silicon. HumanEval scores of 94% confirm coding capability is largely preserved post-modification.
// ANALYSIS
Community uncensored releases of frontier-class models keep pace with official launches, and Nemotron's novel hybrid architecture made this a genuinely hard technical problem to solve.
- –Nemotron-Super-120B's unique three-pathway design — 40 Mamba-2 SSM layers, 40 LatentMoE layers (512 experts, top-22), and 8 attention layers — breaks standard fp16-then-quantize workflows; all surgery must happen at quantization level
- –CRACK weight surgery targets the architectural convergence point of all three pathway types, suppressing refusal behavior at the weight level rather than via prompt injection or fine-tuning
- –4-bit MLX quant achieves 43–58 tok/s on M3 Ultra 256GB, putting 120B-class local inference within reach for well-equipped Mac users
- –LM Studio silently drops 697 essential tensors and is incompatible — only MLX Studio or vMLX work correctly, a notable gotcha for the community
- –A chat template workaround introduced occasional missing closing think tags, an acknowledged tradeoff of the approach
// TAGS
llmopen-weightsopen-sourceself-hostedinferencebenchmark
DISCOVERED
29d ago
2026-03-14
PUBLISHED
29d ago
2026-03-14
RELEVANCE
6/ 10
AUTHOR
HealthyCommunicat