Qwen3.5-122B-A10B hits 80 t/s on RTX Pro 6000

// 93d agoBENCHMARK RESULT

Qwen3.5-122B-A10B hits 80 t/s on RTX Pro 6000

A Reddit benchmark of the MXFP4_MOE quant running in llama.cpp on a single NVIDIA RTX PRO 6000 Blackwell reports roughly 80 tokens/sec for single-stream generation, 143 tokens/sec total at four concurrent requests, and about 220 ms time-to-first-token on a 512-token prompt. The results also show relatively graceful long-context degradation down to about 73 tokens/sec at 65K context, though multi-user long-context workloads become painful fast.

// ANALYSIS

This is the kind of datapoint local inference builders actually need: not synthetic peak numbers, but a practical picture of how a 96 GB Blackwell card handles a very large MoE model under real chat-style loads.

–Single-user interactive performance looks genuinely strong, with sub-second TTFT on short prompts and around 80 t/s generation.
–The long-context story is better than expected for a 122B-class model, with only modest token-generation loss even at 65K depth.
–Concurrency is the real constraint: four short requests scale well enough for batch work, but deep-context chat collapses into multi-second-to-half-minute waits.
–For teams sizing on-prem inference boxes, this suggests one RTX PRO 6000 can comfortably host premium single-user or light multi-user open-weight chat, but not dense long-context shared serving.
–The benchmark also reinforces llama.cpp’s growing role as a serious production-ish local serving stack for open-weight models, not just a hobbyist runtime.

// TAGS

qwen3.5-122b-a10bllmbenchmarkgpuinference

DISCOVERED

93d ago

2026-03-09

PUBLISHED

93d ago

2026-03-08

RELEVANCE

8/ 10

AUTHOR

laziz

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL25m ago

Anthropic releases public Claude Mythos model

Anthropic has publicly released a modified version of its frontier AI model, Claude Mythos, under the name Claude Fable 5. The new public version incorporates safety guardrails to restrict offensive cyber capabilities while the unrestricted model remains limited to vetted partners.

MODEL29m ago

Anthropic launches Claude Fable 5

Anthropic has launched Claude Fable 5, a new "Mythos-class" model designed for complex agentic workflows, software engineering, and research synthesis. The model is available via the Claude API, subscription plans, and cloud platforms, with safety guardrails that fallback to Claude Opus for risky queries.

UPDATE37m ago

Vercel v0 adds /improve via Claude Fable 5

Vercel has integrated a new /improve command into its generative UI design tool, v0, to let users leverage Anthropic's new Claude Fable 5 reasoning model. The feature allows developers to invoke the model's advanced reasoning capabilities to iterate, polish, and optimize generated UI code.

Qwen3.5-122B-A10B hits 80 t/s on RTX Pro 6000