OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoMODEL RELEASE
Nemotron REAP cut hits AIME 90%+
Max-and-Omnis released a REAP-pruned math variant of NVIDIA Nemotron-3-Super, shrinking the 120B latent-MoE model to 64B while keeping 12B active parameters. The AWQ and FP8 builds reportedly top 90% avg@4 on AIME 2026 and fit on a single high-end H100 or RTX PRO 6000 Blackwell.
// ANALYSIS
This is a serious local-inference experiment, but the headline number should be read as a community benchmark until broader evals and reproduction land.
- –REAP pruning from 512 to 256 experts is the real story: it cuts deployment weight without giving up the sparse-MoE active-parameter profile
- –FP8 beats AWQ on quality but takes a roughly 40% throughput hit, making this a practical quality-vs-latency choice for math workloads
- –The included vLLM patch matters because expert routing edge cases still break real-world serving paths for unusual MoE shapes
- –Fine-tuning on about 270 AIMO3 and AstralMath problems means the AIME result is impressive, but narrow and potentially sensitive to prompt placement
- –Single-GPU 90%+ AIME-class math performance is exactly the kind of open-weights pressure that makes smaller, specialized reasoning models worth watching
// TAGS
nemotron-3-super-64b-math-reapllmreasoningfine-tuninginferencegpuopen-weightsbenchmark
DISCOVERED
5h ago
2026-04-22
PUBLISHED
6h ago
2026-04-22
RELEVANCE
9/ 10
AUTHOR
max6296