OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
oMLX batching gains fade with long context
The post argues that batching only delivers big speedups at short context lengths, while 8k-32k prompts on a base M4 show much smaller gains or none at all. The likely reason is that long-context decode is dominated by cache traffic and per-request overhead, so batching cannot amortize weight reads the way it does on shorter prompts.
// ANALYSIS
The surprise is mostly in the expectation, not the result: batching helps when the runtime can keep the machine busy with shared work, but long-context inference quickly becomes a memory- and cache-bound problem where extra concurrency adds less benefit.
- –oMLX’s own benchmarks show strong batching gains at 1k contexts, but the speedup shrinks as context grows and token generation throughput falls sharply at 16k-32k.
- –If requests do not share a prefix, each one still pays its own long-prefill and KV-cache costs, so “batching” looks much less like reuse and more like parallel contention.
- –When prompts are identical, reuse is easier and batching looks better; when they diverge, the runtime loses the chance to amortize the expensive early tokens.
- –On a base M4 with 16GB unified memory, thermal limits and memory pressure can flatten gains before the theoretical compute-vs-bandwidth tradeoff matters.
- –For inference-machine buying decisions, sustained memory bandwidth, KV-cache behavior, and long-context batching efficiency matter more than peak TFLOPS on paper.
// TAGS
benchmarkinferencellmself-hostedomlx
DISCOVERED
3h ago
2026-04-17
PUBLISHED
5h ago
2026-04-16
RELEVANCE
8/ 10
AUTHOR
Seetie_AI