BACK_TO_FEEDAICRIER_2
GPT-OSS 120B boosts throughput on DGX Spark
OPEN_SOURCE ↗
REDDIT · REDDIT// 13d agoBENCHMARK RESULT

GPT-OSS 120B boosts throughput on DGX Spark

OpenAI's GPT-OSS 120B is being benchmarked on NVIDIA's DGX Spark, and the thread is really about serving-stack efficiency rather than model quality. The OP reports about 32 tps in vLLM on a Q4_K_S build, while commenters say native MXFP4 with llama.cpp should push it much closer to 50-60 tps.

// ANALYSIS

This looks more like a stack mismatch than a hard hardware ceiling. GPT-OSS 120B is sparse, open-weight, and native MXFP4, so the fastest path is usually to respect the model's format and let the runtime/kernel stack do the heavy lifting.

  • OpenAI says GPT-OSS 120B has 117B total parameters but only 5.1B active per token, which means kernel efficiency matters a lot.
  • The thread's own numbers line up with that story: ~32 tps in vLLM/Q4_K_S, roughly ~50 tps after switching to llama.cpp/MXFP4, and one reply expecting around 60 tps on DGX Spark.
  • NVIDIA's DGX Spark blog says llama.cpp optimizations have lifted performance by about 35% on average, reinforcing that runtime choice is the biggest lever.
  • If you care about response quality, the win is native precision plus flash-attn, batching, and context tuning, not a harsher quant.
// TAGS
gpt-oss-120bdgx-sparkopen-weightsinferencegpubenchmarkllm

DISCOVERED

13d ago

2026-03-29

PUBLISHED

13d ago

2026-03-29

RELEVANCE

8/ 10

AUTHOR

AdamLangePL