OPEN_SOURCE ↗
REDDIT · REDDIT// 1h agoBENCHMARK RESULT
oMLX oQ rescues aging M1 Max
Updating to oMLX 0.3.6 and redownloading oQ-quantized models reportedly fixed prefill timeouts on a Qwen3.5 30B A3B 4-bit setup running on an M1 Max with a 24-core GPU. The poster also points to DFlash, a new decoder-speed feature, as the next likely leap for local coding workflows.
// ANALYSIS
This is the kind of performance win that actually changes how people use local models, not just a nice benchmark bump. If the numbers hold beyond one machine, oMLX is becoming a serious Apple Silicon backend for agentic coding by attacking the two pain points that matter most: prefill latency and cache churn.
- –The key signal is prefill: Claude Code timing out usually means the server cannot absorb long contexts fast enough, which makes local inference feel unusable even when decode speed is acceptable.
- –oQ-quantized models look like the immediate practical improvement here; DFlash is promising, but the post explicitly says it has not been tested yet.
- –The 32k benchmark context matters because agent workflows live in long-context territory, where repeated recomputation hurts the most.
- –This is less about raw model quality and more about turning a marginal Mac into something steady enough for daily local coding use.
// TAGS
omlxinferencegpubenchmarkagentcliopen-source
DISCOVERED
1h ago
2026-04-17
PUBLISHED
3h ago
2026-04-17
RELEVANCE
8/ 10
AUTHOR
fisherwei