BACK_TO_FEEDAICRIER_2
oMLX oQ rescues aging M1 Max
OPEN_SOURCE ↗
REDDIT · REDDIT// 1h agoBENCHMARK RESULT

oMLX oQ rescues aging M1 Max

Updating to oMLX 0.3.6 and redownloading oQ-quantized models reportedly fixed prefill timeouts on a Qwen3.5 30B A3B 4-bit setup running on an M1 Max with a 24-core GPU. The poster also points to DFlash, a new decoder-speed feature, as the next likely leap for local coding workflows.

// ANALYSIS

This is the kind of performance win that actually changes how people use local models, not just a nice benchmark bump. If the numbers hold beyond one machine, oMLX is becoming a serious Apple Silicon backend for agentic coding by attacking the two pain points that matter most: prefill latency and cache churn.

  • The key signal is prefill: Claude Code timing out usually means the server cannot absorb long contexts fast enough, which makes local inference feel unusable even when decode speed is acceptable.
  • oQ-quantized models look like the immediate practical improvement here; DFlash is promising, but the post explicitly says it has not been tested yet.
  • The 32k benchmark context matters because agent workflows live in long-context territory, where repeated recomputation hurts the most.
  • This is less about raw model quality and more about turning a marginal Mac into something steady enough for daily local coding use.
// TAGS
omlxinferencegpubenchmarkagentcliopen-source

DISCOVERED

1h ago

2026-04-17

PUBLISHED

3h ago

2026-04-17

RELEVANCE

8/ 10

AUTHOR

fisherwei