OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoBENCHMARK RESULT
Qwen3.5 MoE hits 9.5 tok/s on Strix Halo
An r/LocalLLaMA user is trying to spread Qwen3.5-122B-A10B across two 128GB Strix Halo nodes in a k8s cluster with expert parallelism and says the setup reaches 9.5 tok/s. They’re now profiling bottlenecks and considering ROCm kernels, but the real question is whether the complexity beats a simpler parallelism strategy.
// ANALYSIS
Cool experiment, but this reads more like a topology lesson than a throughput win. On a sparse MoE model, EP only pays if cross-node traffic stays tame, and consumer APUs usually expose the pain fast.
- –The official Qwen3.5 card describes the model as a 122B-parameter MoE with 256 experts and 8 routed + 1 shared active per token, so routing overhead is baked into the problem.
- –Qwen's own serving guidance leans on SGLang or vLLM with 8-way tensor parallel, which suggests the default high-performance path is still a mature serving stack, not bespoke cluster choreography.
- –Strix Halo's 128GB unified memory is what makes these experiments possible, but unified memory does not erase bandwidth and interconnect ceilings.
- –One commenter in the thread says a single 128GB Strix Halo can already hit roughly 23-25 tok/s on the same model/quant, so 9.5 tok/s across two machines looks more like an early prototype than a scaling win.
- –Before jumping to custom ROCm kernels, I'd profile whether the bottleneck is routing, memory copies, or scheduler overhead; that answer will tell you whether EP, pipeline parallelism, or a dense-model baseline is the real move.
// TAGS
qwen3.5-122b-a10bstrix-halollminferencegpubenchmarkself-hostedopen-weights
DISCOVERED
20d ago
2026-03-23
PUBLISHED
20d ago
2026-03-23
RELEVANCE
8/ 10
AUTHOR
hortasha