YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5 MoE hits 9.5 tok/s on Strix Halo

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5 MoE hits 9.5 tok/s on Strix Halo
OPEN LINK ↗
// 79d agoBENCHMARK RESULT

Qwen3.5 MoE hits 9.5 tok/s on Strix Halo

An r/LocalLLaMA user is trying to spread Qwen3.5-122B-A10B across two 128GB Strix Halo nodes in a k8s cluster with expert parallelism and says the setup reaches 9.5 tok/s. They’re now profiling bottlenecks and considering ROCm kernels, but the real question is whether the complexity beats a simpler parallelism strategy.

// ANALYSIS

Cool experiment, but this reads more like a topology lesson than a throughput win. On a sparse MoE model, EP only pays if cross-node traffic stays tame, and consumer APUs usually expose the pain fast.

  • The official Qwen3.5 card describes the model as a 122B-parameter MoE with 256 experts and 8 routed + 1 shared active per token, so routing overhead is baked into the problem.
  • Qwen's own serving guidance leans on SGLang or vLLM with 8-way tensor parallel, which suggests the default high-performance path is still a mature serving stack, not bespoke cluster choreography.
  • Strix Halo's 128GB unified memory is what makes these experiments possible, but unified memory does not erase bandwidth and interconnect ceilings.
  • One commenter in the thread says a single 128GB Strix Halo can already hit roughly 23-25 tok/s on the same model/quant, so 9.5 tok/s across two machines looks more like an early prototype than a scaling win.
  • Before jumping to custom ROCm kernels, I'd profile whether the bottleneck is routing, memory copies, or scheduler overhead; that answer will tell you whether EP, pipeline parallelism, or a dense-model baseline is the real move.
// TAGS
qwen3.5-122b-a10bstrix-halollminferencegpubenchmarkself-hostedopen-weights

DISCOVERED

79d ago

2026-03-23

PUBLISHED

79d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

hortasha