Nemotron REAP cut hits AIME 90%+
Max-and-Omnis released a REAP-pruned math variant of NVIDIA Nemotron-3-Super, shrinking the 120B latent-MoE model to 64B while keeping 12B active parameters. The AWQ and FP8 builds reportedly top 90% avg@4 on AIME 2026 and fit on a single high-end H100 or RTX PRO 6000 Blackwell.
This is a serious local-inference experiment, but the headline number should be read as a community benchmark until broader evals and reproduction land.
- –REAP pruning from 512 to 256 experts is the real story: it cuts deployment weight without giving up the sparse-MoE active-parameter profile
- –FP8 beats AWQ on quality but takes a roughly 40% throughput hit, making this a practical quality-vs-latency choice for math workloads
- –The included vLLM patch matters because expert routing edge cases still break real-world serving paths for unusual MoE shapes
- –Fine-tuning on about 270 AIMO3 and AstralMath problems means the AIME result is impressive, but narrow and potentially sensitive to prompt placement
- –Single-GPU 90%+ AIME-class math performance is exactly the kind of open-weights pressure that makes smaller, specialized reasoning models worth watching
DISCOVERED
45d ago
2026-04-22
PUBLISHED
45d ago
2026-04-22
RELEVANCE
AUTHOR
max6296