OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoMODEL RELEASE
Qwen3.5-122B hits performance ceiling on Apple Silicon
A LocalLLaMA user reports consistent 10 tok/s performance for the Qwen3.5-122B-A10B MoE model on high-end M4 Max and M1 Ultra hardware. Despite exhaustive configuration tweaks in llama.cpp, memory bandwidth remains the primary bottleneck for this 122B parameter model.
// ANALYSIS
Qwen3.5-122B-A10B is the new heavyweight champion for local inference, but it demands specific software stacks to shine.
- –10 tok/s on llama.cpp is the expected floor for a model of this scale; MLX is required to hit the 40+ tok/s ceiling on M4 Max.
- –Performance degradation at 50k+ context points to KV cache overhead and memory pressure, common in MoE models with large context windows.
- –128GB of Unified Memory is the minimum requirement for 4-bit quants; any higher precision or context quickly triggers memory swap.
- –Users seeking interactive coding speeds should pivot to the 27B dense variant or prioritize the MLX framework over traditional GGUF backends.
// TAGS
qwen3.5-122b-a10bllminferencebenchmarkopen-weights
DISCOVERED
4h ago
2026-04-15
PUBLISHED
4h ago
2026-04-14
RELEVANCE
8/ 10
AUTHOR
lots_of_apples