YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5-122B-A10B hits 80 t/s on RTX Pro 6000

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5-122B-A10B hits 80 t/s on RTX Pro 6000
OPEN LINK ↗
// 79d agoBENCHMARK RESULT

Qwen3.5-122B-A10B hits 80 t/s on RTX Pro 6000

A Reddit benchmark of the MXFP4_MOE quant running in llama.cpp on a single NVIDIA RTX PRO 6000 Blackwell reports roughly 80 tokens/sec for single-stream generation, 143 tokens/sec total at four concurrent requests, and about 220 ms time-to-first-token on a 512-token prompt. The results also show relatively graceful long-context degradation down to about 73 tokens/sec at 65K context, though multi-user long-context workloads become painful fast.

// ANALYSIS

This is the kind of datapoint local inference builders actually need: not synthetic peak numbers, but a practical picture of how a 96 GB Blackwell card handles a very large MoE model under real chat-style loads.

  • Single-user interactive performance looks genuinely strong, with sub-second TTFT on short prompts and around 80 t/s generation.
  • The long-context story is better than expected for a 122B-class model, with only modest token-generation loss even at 65K depth.
  • Concurrency is the real constraint: four short requests scale well enough for batch work, but deep-context chat collapses into multi-second-to-half-minute waits.
  • For teams sizing on-prem inference boxes, this suggests one RTX PRO 6000 can comfortably host premium single-user or light multi-user open-weight chat, but not dense long-context shared serving.
  • The benchmark also reinforces llama.cpp’s growing role as a serious production-ish local serving stack for open-weight models, not just a hobbyist runtime.
// TAGS
qwen3.5-122b-a10bllmbenchmarkgpuinference

DISCOVERED

79d ago

2026-03-09

PUBLISHED

80d ago

2026-03-08

RELEVANCE

8/ 10

AUTHOR

laziz