TurboQuant benchmarks show Metal slowdown

// 108d agoBENCHMARK RESULT

TurboQuant benchmarks show Metal slowdown

Google Research's TurboQuant claims 3-bit KV-cache compression with 6x+ memory savings and no accuracy loss, and llama.cpp contributors are already prototyping it. The early benchmark story is promising on memory, but Apple Silicon and CUDA performance still look very implementation-dependent.

// ANALYSIS

This looks like a real context-window breakthrough, but the current numbers read more like immature kernels than a flawed algorithm.

–Google’s blog says TurboQuant can cut KV-cache memory by at least 6x on long-context benchmarks while preserving quality on Llama-3.1-8B-Instruct.
–llama.cpp already has CPU, Metal, and CUDA experiments, which is a strong sign the method is portable across local-inference stacks.
–The Metal slowdown is plausible as an implementation issue: one contributor notes the current rotation path is still unoptimized, and Metal JIT can silently fall back to CPU if the shader setup is wrong.
–The CUDA path still needs correctness work; one tester reported garbage outputs even when the KV savings matched, which is a bigger blocker than raw speed.
–For local-model users, the real win is practical: more usable context on 8-16GB VRAM or RAM-constrained machines, not the death of RAG.

// TAGS

turboquantllama-cppllmbenchmarkinferenceopen-sourcegpu

DISCOVERED

108d ago

2026-03-26

PUBLISHED

108d ago

2026-03-26

RELEVANCE

9/ 10

AUTHOR

tcarambat

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

Native SDK v0.5 compiles TypeScript to native

Vercel Labs has released Native SDK v0.5, introducing TypeScript support to compile applications directly to native machine code without a JavaScript engine or garbage collector. Designed with AI agents in mind, the update features 83ns update dispatch latency, supports robust TypeScript features, and allows developers to eject to Zig at any point.

UPDATE1h ago

SST Console demos AI-built settings screen

SST co-founder Dax Raad demonstrated a new settings screen for the SST Console built entirely via an interactive, Slack-integrated AI coding agent. The development involved collaborative team prompting and iterative feedback loops with the agent, resulting in a functional interface and automated walkthrough video.

UPDATE2h ago

Perplexity Computer integrates Grok 4.5

Perplexity has integrated xAI's Grok 4.5 as the orchestrator for Perplexity Computer, achieving a top score of 0.328 on its internal WANDR benchmark. The integration is highly cost-effective, running at approximately half the cost of Anthropic's Claude Opus 4.8.