TurboQuant cuts KV cache 3x on Apple Silicon

// 97d agoBENCHMARK RESULT

TurboQuant cuts KV cache 3x on Apple Silicon

This post shares real-world Apple Silicon benchmarks for TurboQuant, a KV-cache compression approach implemented in a community llama.cpp fork with Metal support. On a Mac mini M4 16GB, KV cache dropped from 1280 MiB to 465 MiB on Qwen3-14B at 8K context, with throughput moving from 9.95 t/s to 9.25 t/s. On an M3 Max 48GB running Qwen3.5 35B at 128K context, KV cache fell from 2560 MiB to 930 MiB, while throughput stayed relatively close at 45.34 t/s versus 42.88 t/s. The main takeaway is that TurboQuant is less about raw speed gains and more about freeing substantial memory headroom for longer contexts and more concurrent workloads.

// ANALYSIS

Hot take: this is a memory-efficiency story, not a speed story, and that is exactly why it matters on Apple Silicon.

–The benchmark shows roughly 3x KV-cache compression in both cases, which is the meaningful win here.
–Throughput only drops modestly, so the tradeoff looks practical for long-context local inference.
–The 128K-context result is more important operationally than the smaller test, because it expands headroom for multi-agent or multi-process workloads.
–This is a community fork/implementation benchmark, not a mainline llama.cpp release, so reproducibility depends on that fork and its Metal kernels.
–The asymmetric `q8_0` K / `turbo3` V setup is the key technical idea: preserve attention routing more carefully while compressing values more aggressively.

// TAGS

turboquantkv-cachellamacppapple-siliconmetalbenchmarkqwenquantization

DISCOVERED

97d ago

2026-04-06

PUBLISHED

97d ago

2026-04-06

RELEVANCE

10/ 10

AUTHOR

Expensive-String8854

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE13m ago

prose stylesheet forces clean AI writing

prose is a lightweight, single-file Markdown prompt configuration that guides AI coding agents to communicate like a direct, confident senior engineer. Appended directly to local agent instruction files, it establishes clear rules to eliminate common AI patterns like cheesy setups, over-bulleted reasoning, and theatrical language.

MODEL3h ago

Reve 2.1 drops native 4K rendering

Reve has released version 2.1 of its creative image generation model, introducing native 4K rendering, object-level editing, and a new "Live Layers" feature. The update enables users to perform localized edits and manage layouts directly, catering to professional design workflows requiring precise control.

OPEN SOURCE3h ago

ABot-World simulates infinite 720p worlds on single GPU

ABot-World is an open-source, action-conditioned infinite world simulator designed to generate interactive 720p environments at 16 frames per second with low latency on a single desktop GPU. By utilizing an NVIDIA RTX 5090 and requiring just 19GB of GPU memory, this embodied world model offers physical compliance, action controllability, and zero-shot generalization, making real-time, interactive environment simulation accessible on consumer-grade hardware.