BACK_TO_FEEDAICRIER_2
Qwen3.5 MoE hits 9.5 tok/s on Strix Halo
OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoBENCHMARK RESULT

Qwen3.5 MoE hits 9.5 tok/s on Strix Halo

An r/LocalLLaMA user is trying to spread Qwen3.5-122B-A10B across two 128GB Strix Halo nodes in a k8s cluster with expert parallelism and says the setup reaches 9.5 tok/s. They’re now profiling bottlenecks and considering ROCm kernels, but the real question is whether the complexity beats a simpler parallelism strategy.

// ANALYSIS

Cool experiment, but this reads more like a topology lesson than a throughput win. On a sparse MoE model, EP only pays if cross-node traffic stays tame, and consumer APUs usually expose the pain fast.

  • The official Qwen3.5 card describes the model as a 122B-parameter MoE with 256 experts and 8 routed + 1 shared active per token, so routing overhead is baked into the problem.
  • Qwen's own serving guidance leans on SGLang or vLLM with 8-way tensor parallel, which suggests the default high-performance path is still a mature serving stack, not bespoke cluster choreography.
  • Strix Halo's 128GB unified memory is what makes these experiments possible, but unified memory does not erase bandwidth and interconnect ceilings.
  • One commenter in the thread says a single 128GB Strix Halo can already hit roughly 23-25 tok/s on the same model/quant, so 9.5 tok/s across two machines looks more like an early prototype than a scaling win.
  • Before jumping to custom ROCm kernels, I'd profile whether the bottleneck is routing, memory copies, or scheduler overhead; that answer will tell you whether EP, pipeline parallelism, or a dense-model baseline is the real move.
// TAGS
qwen3.5-122b-a10bstrix-halollminferencegpubenchmarkself-hostedopen-weights

DISCOVERED

20d ago

2026-03-23

PUBLISHED

20d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

hortasha