OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoBENCHMARK RESULT
Nemotron 3 Super splits vLLM, llama.cpp
A LocalLLaMA user reports Nemotron 3 Super scoring about 55.4% in vLLM with NVIDIA's NVFP4 checkpoint, but only about 40% in llama.cpp with GGUF quants, even after matching the obvious sampling settings. A Gemma 3 control run came out nearly identical across backends, which makes this look model-specific rather than a general runtime gap.
// ANALYSIS
This is probably a backend-fidelity problem, not a temperature problem. NVIDIA's reference setup is much more specific than a bare server launch, so a community GGUF path can drift enough to move scores.
- –NVIDIA's quick start already standardizes `temperature=1.0` and `top_p=0.95`, and exposes `enable_thinking` through the chat template, so the knobs the poster tried are unlikely to explain a 15-point swing.
- –The official vLLM recipe adds a custom `super_v3` reasoning parser plus `--enable-expert-parallel`, `--enable-chunked-prefill`, and Mamba cache tuning, which suggests model-specific serving logic beyond generic inference defaults.
- –NVIDIA publishes BF16, FP8, and NVFP4 checkpoints with official vLLM, SGLang, TRT-LLM, and NIM guidance, while the llama.cpp run here depends on a community GGUF conversion, so this is not an apples-to-apples comparison.
- –The Gemma 3 sanity check, plus Reddit speculation about tokenizer/template mismatch, points more toward prompt serialization or reasoning-token handling than raw quantization quality.
// TAGS
nemotron-3-superllama-cppvllmbenchmarkinferencereasoningopen-weights
DISCOVERED
14d ago
2026-03-28
PUBLISHED
14d ago
2026-03-28
RELEVANCE
8/ 10
AUTHOR
BigStupidJellyfish_