OPEN_SOURCE ↗
REDDIT · REDDIT// 9d agoBENCHMARK RESULT
Qwen 3.5 MoE tops Gemma 4 M5 benchmarks
Performance benchmarks on the MacBook M5 (128GB RAM) utilizing the oMLX framework demonstrate that Qwen 3.5 MoE remains the throughput leader for local agentic workloads, despite Gemma 4's gains in responsiveness. The results highlight the M5's new Neural Accelerator, which provides up to 4x faster prompt processing, and the efficacy of oMLX’s tiered KV caching in reducing latency for long-context multi-turn interactions.
// ANALYSIS
The M5 Max and oMLX are turning local Macs into viable high-performance inference servers, with MoE architectures clearly winning on Apple Silicon.
- –Qwen 3.5 MoE (35B-A3B) is the current performance champion, achieving 92.2 tok/s for generation and nearly 2,850 tok/s for prompt processing.
- –oMLX's tiered KV caching leverages SSD storage to restore context prefixes in under 2 seconds, a massive improvement over the 60+ second prefill times seen in standard MLX implementations.
- –The M5's Neural Accelerator specifically boosts the prefill stage, making dense models more responsive but not yet competitive with MoE throughput.
- –While Gemma 4 is more memory-efficient and responsive for "edge" tasks, it lags behind Qwen in sustained batching and serving performance for heavy developer workloads.
- –SSD-based context persistence is becoming the new baseline for "agentic" local LLM tools like Claude Code and Cursor.
// TAGS
omlxmlxapple-siliconllmbenchmarkgemma-4qwen3-5edge-ai
DISCOVERED
9d ago
2026-04-03
PUBLISHED
9d ago
2026-04-02
RELEVANCE
8/ 10
AUTHOR
onil_gova