OPEN_SOURCE ↗
REDDIT · REDDIT// 13d agoBENCHMARK RESULT
GPT-OSS 120B boosts throughput on DGX Spark
OpenAI's GPT-OSS 120B is being benchmarked on NVIDIA's DGX Spark, and the thread is really about serving-stack efficiency rather than model quality. The OP reports about 32 tps in vLLM on a Q4_K_S build, while commenters say native MXFP4 with llama.cpp should push it much closer to 50-60 tps.
// ANALYSIS
This looks more like a stack mismatch than a hard hardware ceiling. GPT-OSS 120B is sparse, open-weight, and native MXFP4, so the fastest path is usually to respect the model's format and let the runtime/kernel stack do the heavy lifting.
- –OpenAI says GPT-OSS 120B has 117B total parameters but only 5.1B active per token, which means kernel efficiency matters a lot.
- –The thread's own numbers line up with that story: ~32 tps in vLLM/Q4_K_S, roughly ~50 tps after switching to llama.cpp/MXFP4, and one reply expecting around 60 tps on DGX Spark.
- –NVIDIA's DGX Spark blog says llama.cpp optimizations have lifted performance by about 35% on average, reinforcing that runtime choice is the biggest lever.
- –If you care about response quality, the win is native precision plus flash-attn, batching, and context tuning, not a harsher quant.
// TAGS
gpt-oss-120bdgx-sparkopen-weightsinferencegpubenchmarkllm
DISCOVERED
13d ago
2026-03-29
PUBLISHED
13d ago
2026-03-29
RELEVANCE
8/ 10
AUTHOR
AdamLangePL