OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoBENCHMARK RESULT
Mac users see Qwen3.5 GGUF outrun MLX
A LocalLLaMA user with an M3 Ultra Mac Studio (512GB) reports much faster prompt processing and steadier token generation using Qwen3.5 GGUF models in llama.cpp versus MLX for long-context, agentic coding tasks. The post says llama.cpp prompt caching feels more reliable in real multi-file workflows and asks the community for corrections and better tuning advice.
// ANALYSIS
This reads less like “MLX is bad” and more like a practical warning that long-context runtime behavior matters more than peak tokens-per-second claims.
- –The benchmark scenario is developer-realistic (multi-file coding, debugging, MCP/tool calls), where prefill speed and cache reuse dominate perceived responsiveness.
- –Recent llama.cpp hybrid-cache updates (including checkpointing controls) indicate rapid iteration on Qwen3.5 long-context pain points.
- –Some full reprocessing behavior appears linked to hybrid/recurrent-memory constraints and changing prompt prefixes, so client prompt construction can materially affect results.
- –For Mac workflows, a two-model strategy (faster 35B for iteration, larger 122B for final quality) is emerging as a pragmatic pattern.
// TAGS
qwen3.5llminferencebenchmarkai-codingmcpself-hostedopen-source
DISCOVERED
25d ago
2026-03-17
PUBLISHED
25d ago
2026-03-17
RELEVANCE
8/ 10
AUTHOR
BitXorBit