BACK_TO_FEEDAICRIER_2
Nemotron 3 Super splits vLLM, llama.cpp
OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoBENCHMARK RESULT

Nemotron 3 Super splits vLLM, llama.cpp

A LocalLLaMA user reports Nemotron 3 Super scoring about 55.4% in vLLM with NVIDIA's NVFP4 checkpoint, but only about 40% in llama.cpp with GGUF quants, even after matching the obvious sampling settings. A Gemma 3 control run came out nearly identical across backends, which makes this look model-specific rather than a general runtime gap.

// ANALYSIS

This is probably a backend-fidelity problem, not a temperature problem. NVIDIA's reference setup is much more specific than a bare server launch, so a community GGUF path can drift enough to move scores.

  • NVIDIA's quick start already standardizes `temperature=1.0` and `top_p=0.95`, and exposes `enable_thinking` through the chat template, so the knobs the poster tried are unlikely to explain a 15-point swing.
  • The official vLLM recipe adds a custom `super_v3` reasoning parser plus `--enable-expert-parallel`, `--enable-chunked-prefill`, and Mamba cache tuning, which suggests model-specific serving logic beyond generic inference defaults.
  • NVIDIA publishes BF16, FP8, and NVFP4 checkpoints with official vLLM, SGLang, TRT-LLM, and NIM guidance, while the llama.cpp run here depends on a community GGUF conversion, so this is not an apples-to-apples comparison.
  • The Gemma 3 sanity check, plus Reddit speculation about tokenizer/template mismatch, points more toward prompt serialization or reasoning-token handling than raw quantization quality.
// TAGS
nemotron-3-superllama-cppvllmbenchmarkinferencereasoningopen-weights

DISCOVERED

14d ago

2026-03-28

PUBLISHED

14d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

BigStupidJellyfish_