OPEN_SOURCE ↗
REDDIT · REDDIT// 34d agoBENCHMARK RESULT
Qwen3.5-122B-A10B hits 80 t/s on RTX Pro 6000
A Reddit benchmark of the MXFP4_MOE quant running in llama.cpp on a single NVIDIA RTX PRO 6000 Blackwell reports roughly 80 tokens/sec for single-stream generation, 143 tokens/sec total at four concurrent requests, and about 220 ms time-to-first-token on a 512-token prompt. The results also show relatively graceful long-context degradation down to about 73 tokens/sec at 65K context, though multi-user long-context workloads become painful fast.
// ANALYSIS
This is the kind of datapoint local inference builders actually need: not synthetic peak numbers, but a practical picture of how a 96 GB Blackwell card handles a very large MoE model under real chat-style loads.
- –Single-user interactive performance looks genuinely strong, with sub-second TTFT on short prompts and around 80 t/s generation.
- –The long-context story is better than expected for a 122B-class model, with only modest token-generation loss even at 65K depth.
- –Concurrency is the real constraint: four short requests scale well enough for batch work, but deep-context chat collapses into multi-second-to-half-minute waits.
- –For teams sizing on-prem inference boxes, this suggests one RTX PRO 6000 can comfortably host premium single-user or light multi-user open-weight chat, but not dense long-context shared serving.
- –The benchmark also reinforces llama.cpp’s growing role as a serious production-ish local serving stack for open-weight models, not just a hobbyist runtime.
// TAGS
qwen3.5-122b-a10bllmbenchmarkgpuinference
DISCOVERED
34d ago
2026-03-09
PUBLISHED
34d ago
2026-03-08
RELEVANCE
8/ 10
AUTHOR
laziz