llama.cpp adds TurboQuant lite KV cache
llama.cpp integrates "attn-rot," a simplified TurboQuant implementation that enables high-quality 4-bit KV cache quantization. By using Hadamard transforms to redistribute outliers, the update allows for massive context windows with minimal reasoning loss on consumer hardware.
Hadamard rotation is the "holy grail" for local LLM efficiency, solving the logic breakdown usually seen with aggressive KV cache quantization. This merge effectively doubles context capacity for most consumer GPUs without sacrificing intelligence.
- –Hadamard transforms redistribute the "energy" of outlier vectors, making them easier to quantize accurately.
- –4-bit KV caches previously caused models to "break down" in logic; this update brings them near full-precision performance.
- –While adding a minor 2-12% performance hit, the ability to fit 2-4x more context into the same memory is a massive trade-off for most developers.
- –Implementation is backend-agnostic, providing immediate gains for CUDA, Metal, and CPU inference.
- –Focuses on the "rotation" aspect of the TurboQuant paper to maintain speed while gaining precision.
DISCOVERED
70d ago
2026-04-01
PUBLISHED
70d ago
2026-03-31
RELEVANCE
AUTHOR
Dany0