oMLX batching gains fade with long context

// 57d agoBENCHMARK RESULT

oMLX batching gains fade with long context

The post argues that batching only delivers big speedups at short context lengths, while 8k-32k prompts on a base M4 show much smaller gains or none at all. The likely reason is that long-context decode is dominated by cache traffic and per-request overhead, so batching cannot amortize weight reads the way it does on shorter prompts.

// ANALYSIS

The surprise is mostly in the expectation, not the result: batching helps when the runtime can keep the machine busy with shared work, but long-context inference quickly becomes a memory- and cache-bound problem where extra concurrency adds less benefit.

–oMLX’s own benchmarks show strong batching gains at 1k contexts, but the speedup shrinks as context grows and token generation throughput falls sharply at 16k-32k.
–If requests do not share a prefix, each one still pays its own long-prefill and KV-cache costs, so “batching” looks much less like reuse and more like parallel contention.
–When prompts are identical, reuse is easier and batching looks better; when they diverge, the runtime loses the chance to amortize the expensive early tokens.
–On a base M4 with 16GB unified memory, thermal limits and memory pressure can flatten gains before the theoretical compute-vs-bandwidth tradeoff matters.
–For inference-machine buying decisions, sustained memory bandwidth, KV-cache behavior, and long-context batching efficiency matter more than peak TFLOPS on paper.

// TAGS

benchmarkinferencellmself-hostedomlx

DISCOVERED

57d ago

2026-04-17

PUBLISHED

57d ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

Seetie_AI

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE59m ago

Qdrant is a production-ready, open-source vector search engine and database built in Rust for high-performance similarity search.

Qdrant is an open-source, high-performance vector search engine and database written in Rust. It is designed to store, manage, and search high-dimensional vectors with extended filtering support, making it well-suited for AI, machine learning, recommendation systems, and Retrieval-Augmented Generation (RAG) applications. Qdrant features a RESTful API and offers official client libraries for multiple programming languages including Python, Go, and TypeScript.

INFRA1h ago

The developer behind Crabbox is increasingly relying on Codex to automate the overwhelming volume of issues and pull requests.

The maintainer of Crabbox, an open-source project by OpenClaw, has integrated Codex directly into the build process to help manage a flood of community contributions. Codex has been running continuously inside Crabbox for the past four days, becoming an essential piece of infrastructure for testing and landing PRs.

UPDATE1h ago

Small Harness offers a lightweight framework for running local LLMs, with plans to introduce hybrid frontier-local orchestration optimized by VulcanBench.

Small Harness is a fast, transparent, open-source terminal harness designed to run local Large Language Models (LLMs) on local hardware. The creator, Morgan Linton, announced that the project will soon support hybrid workflows that connect frontier models for planning with local models for execution. Additionally, a tool called VulcanBench will be integrated to optimize when each model tier is utilized.