TurboQuant cuts KV cache 3x on Apple Silicon
This post shares real-world Apple Silicon benchmarks for TurboQuant, a KV-cache compression approach implemented in a community llama.cpp fork with Metal support. On a Mac mini M4 16GB, KV cache dropped from 1280 MiB to 465 MiB on Qwen3-14B at 8K context, with throughput moving from 9.95 t/s to 9.25 t/s. On an M3 Max 48GB running Qwen3.5 35B at 128K context, KV cache fell from 2560 MiB to 930 MiB, while throughput stayed relatively close at 45.34 t/s versus 42.88 t/s. The main takeaway is that TurboQuant is less about raw speed gains and more about freeing substantial memory headroom for longer contexts and more concurrent workloads.
Hot take: this is a memory-efficiency story, not a speed story, and that is exactly why it matters on Apple Silicon.
- –The benchmark shows roughly 3x KV-cache compression in both cases, which is the meaningful win here.
- –Throughput only drops modestly, so the tradeoff looks practical for long-context local inference.
- –The 128K-context result is more important operationally than the smaller test, because it expands headroom for multi-agent or multi-process workloads.
- –This is a community fork/implementation benchmark, not a mainline llama.cpp release, so reproducibility depends on that fork and its Metal kernels.
- –The asymmetric `q8_0` K / `turbo3` V setup is the key technical idea: preserve attention routing more carefully while compressing values more aggressively.
DISCOVERED
6d ago
2026-04-06
PUBLISHED
6d ago
2026-04-06
RELEVANCE
AUTHOR
Expensive-String8854