BACK_TO_FEEDAICRIER_2
TurboQuant cuts KV cache 3x on Apple Silicon
OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoBENCHMARK RESULT

TurboQuant cuts KV cache 3x on Apple Silicon

This post shares real-world Apple Silicon benchmarks for TurboQuant, a KV-cache compression approach implemented in a community llama.cpp fork with Metal support. On a Mac mini M4 16GB, KV cache dropped from 1280 MiB to 465 MiB on Qwen3-14B at 8K context, with throughput moving from 9.95 t/s to 9.25 t/s. On an M3 Max 48GB running Qwen3.5 35B at 128K context, KV cache fell from 2560 MiB to 930 MiB, while throughput stayed relatively close at 45.34 t/s versus 42.88 t/s. The main takeaway is that TurboQuant is less about raw speed gains and more about freeing substantial memory headroom for longer contexts and more concurrent workloads.

// ANALYSIS

Hot take: this is a memory-efficiency story, not a speed story, and that is exactly why it matters on Apple Silicon.

  • The benchmark shows roughly 3x KV-cache compression in both cases, which is the meaningful win here.
  • Throughput only drops modestly, so the tradeoff looks practical for long-context local inference.
  • The 128K-context result is more important operationally than the smaller test, because it expands headroom for multi-agent or multi-process workloads.
  • This is a community fork/implementation benchmark, not a mainline llama.cpp release, so reproducibility depends on that fork and its Metal kernels.
  • The asymmetric `q8_0` K / `turbo3` V setup is the key technical idea: preserve attention routing more carefully while compressing values more aggressively.
// TAGS
turboquantkv-cachellamacppapple-siliconmetalbenchmarkqwenquantization

DISCOVERED

6d ago

2026-04-06

PUBLISHED

6d ago

2026-04-06

RELEVANCE

10/ 10

AUTHOR

Expensive-String8854