OPEN_SOURCE ↗
REDDIT · REDDIT// 10d agoBENCHMARK RESULT
Qwen3.5-122B-A10B hits 30 tps on M1 Ultra
A Reddit user ran Qwen3.5-122B-A10B in Unsloth Studio with a llama.cpp backend on an M1 Ultra, feeding it a 22K-token TurboQuant paper. They reported 396 tps prompt processing and 30.5 tps token generation, which is a useful signal that this giant MoE model is locally usable on Apple Silicon with the right quantization.
// ANALYSIS
This reads less like a one-off curiosity and more like evidence that Qwen3.5’s 10B-active MoE design can make very large models practical on consumer Macs. The real story is the decode speed: 30.5 tps is not “desktop toy” territory for a 122B model.
- –The 396 tps prompt speed suggests prefill is very healthy, even with a 22K-token context, so long-context workflows may be more usable than people expect.
- –The 30.5 tps generation rate is the number that matters for interactive use, and it’s respectable for a local 122B-class model on Apple Silicon.
- –llama.cpp via Unsloth Studio is a solid baseline, but MLX is the obvious next comparison point for Mac owners chasing more throughput.
- –This is still a single user report, so quant choice, sampler settings, and context length could move the numbers a lot.
- –The broader takeaway: Qwen3.5-122B-A10B is no longer automatically “too big for Mac,” but it remains highly setup-sensitive.
// TAGS
qwen3-5-122b-a10bllmbenchmarkinferenceself-hostedopen-weights
DISCOVERED
10d ago
2026-04-01
PUBLISHED
10d ago
2026-04-01
RELEVANCE
8/ 10
AUTHOR
One_Key_8127