llama.cpp hits 3-bit KV cache via TurboQuant

// 102d agoOPENSOURCE RELEASE

llama.cpp hits 3-bit KV cache via TurboQuant

Google Research's TurboQuant algorithm has been integrated into llama.cpp, enabling 6x KV cache compression to 3.25 bits with zero accuracy loss. The breakthrough allows 30B parameter models like Nemotron to achieve 17 tokens/sec on consumer 8GB GPUs.

// ANALYSIS

TurboQuant eliminates quantization overhead by using Walsh-Hadamard rotations to create mathematically predictable distributions for Lloyd-Max quantizers. This enables 30B+ models to fit comfortably in 8GB VRAM alongside full 8k+ context windows, a feat previously impossible without massive speed degradation. The algorithm employs a unique two-stage approach—PolarQuant and QJL—that preserves attention precision significantly better than standard 4-bit INT quantization while reducing memory bandwidth bottlenecks. This delivers up to an 8x speedup in attention computation and remains model-agnostic, making it immediately applicable to any Transformer architecture including Llama, Mistral, and Gemma.

// TAGS

turboquantllama-cppllminferencegpuopen-source

DISCOVERED

102d ago

2026-04-01

PUBLISHED

102d ago

2026-03-31

RELEVANCE

10/ 10

AUTHOR

kvatrovit

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE18m ago

prose stylesheet forces clean AI writing

prose is a lightweight, single-file Markdown prompt configuration that guides AI coding agents to communicate like a direct, confident senior engineer. Appended directly to local agent instruction files, it establishes clear rules to eliminate common AI patterns like cheesy setups, over-bulleted reasoning, and theatrical language.

MODEL3h ago

Reve 2.1 drops native 4K rendering

Reve has released version 2.1 of its creative image generation model, introducing native 4K rendering, object-level editing, and a new "Live Layers" feature. The update enables users to perform localized edits and manage layouts directly, catering to professional design workflows requiring precise control.

OPEN SOURCE3h ago

ABot-World simulates infinite 720p worlds on single GPU

ABot-World is an open-source, action-conditioned infinite world simulator designed to generate interactive 720p environments at 16 frames per second with low latency on a single desktop GPU. By utilizing an NVIDIA RTX 5090 and requiring just 19GB of GPU memory, this embodied world model offers physical compliance, action controllability, and zero-shot generalization, making real-time, interactive environment simulation accessible on consumer-grade hardware.