BACK_TO_FEEDAICRIER_2
Qwen3.5-122B hits performance ceiling on Apple Silicon
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoMODEL RELEASE

Qwen3.5-122B hits performance ceiling on Apple Silicon

A LocalLLaMA user reports consistent 10 tok/s performance for the Qwen3.5-122B-A10B MoE model on high-end M4 Max and M1 Ultra hardware. Despite exhaustive configuration tweaks in llama.cpp, memory bandwidth remains the primary bottleneck for this 122B parameter model.

// ANALYSIS

Qwen3.5-122B-A10B is the new heavyweight champion for local inference, but it demands specific software stacks to shine.

  • 10 tok/s on llama.cpp is the expected floor for a model of this scale; MLX is required to hit the 40+ tok/s ceiling on M4 Max.
  • Performance degradation at 50k+ context points to KV cache overhead and memory pressure, common in MoE models with large context windows.
  • 128GB of Unified Memory is the minimum requirement for 4-bit quants; any higher precision or context quickly triggers memory swap.
  • Users seeking interactive coding speeds should pivot to the 27B dense variant or prioritize the MLX framework over traditional GGUF backends.
// TAGS
qwen3.5-122b-a10bllminferencebenchmarkopen-weights

DISCOVERED

4h ago

2026-04-15

PUBLISHED

4h ago

2026-04-14

RELEVANCE

8/ 10

AUTHOR

lots_of_apples