Qwen3.5 MoE hits 9.5 tok/s on Strix Halo

// 79d agoBENCHMARK RESULT

Qwen3.5 MoE hits 9.5 tok/s on Strix Halo

An r/LocalLLaMA user is trying to spread Qwen3.5-122B-A10B across two 128GB Strix Halo nodes in a k8s cluster with expert parallelism and says the setup reaches 9.5 tok/s. They’re now profiling bottlenecks and considering ROCm kernels, but the real question is whether the complexity beats a simpler parallelism strategy.

// ANALYSIS

Cool experiment, but this reads more like a topology lesson than a throughput win. On a sparse MoE model, EP only pays if cross-node traffic stays tame, and consumer APUs usually expose the pain fast.

–The official Qwen3.5 card describes the model as a 122B-parameter MoE with 256 experts and 8 routed + 1 shared active per token, so routing overhead is baked into the problem.
–Qwen's own serving guidance leans on SGLang or vLLM with 8-way tensor parallel, which suggests the default high-performance path is still a mature serving stack, not bespoke cluster choreography.
–Strix Halo's 128GB unified memory is what makes these experiments possible, but unified memory does not erase bandwidth and interconnect ceilings.
–One commenter in the thread says a single 128GB Strix Halo can already hit roughly 23-25 tok/s on the same model/quant, so 9.5 tok/s across two machines looks more like an early prototype than a scaling win.
–Before jumping to custom ROCm kernels, I'd profile whether the bottleneck is routing, memory copies, or scheduler overhead; that answer will tell you whether EP, pipeline parallelism, or a dense-model baseline is the real move.

// TAGS

qwen3.5-122b-a10bstrix-halollminferencegpubenchmarkself-hostedopen-weights

DISCOVERED

79d ago

2026-03-23

PUBLISHED

79d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

hortasha

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL20m ago

Designers praise Claude Fable 5 landing pages

Educator and designer Meng To highlighted Claude Fable 5's capability for creating landing pages on X, calling the model "a monster" for the task. Released in June 2026, Claude Fable 5 is Anthropic's latest Mythos-class AI model, featuring a 1-million-token context window, a 128,000-token output capacity, and advanced reasoning for long-horizon agentic workflows, making it highly effective for complex design and front-end code generation tasks.

MODEL1h ago

Claude Fable 5 hits Google Cloud

Anthropic's new Mythos-class frontier AI model, Claude Fable 5, is now generally available on Google Cloud's Agent Platform (Vertex AI). Designed for complex, long-horizon reasoning and autonomous workflows, Fable 5 is built for tasks such as software engineering, deep research, and multi-day agentic execution, featuring built-in safety guardrails that automatically redirect sensitive queries to Claude Opus 4.8.

UPDATE1h ago

B.AI integrates Claude Fable 5 into developer API

Developer platform B.AI has integrated Anthropic's Claude Fable 5 model into its API ecosystem. Developers can now utilize Claude Fable 5's advanced reasoning and code generation capabilities within B.AI's unified, OpenAI-compatible API framework, which simplifies model access, agent identity management, and transaction payments.