OPEN_SOURCE ↗
REDDIT · REDDIT// 10d agoPRODUCT UPDATE
llama.cpp lands attn-rot quantization boost
Merged PR #21038 adds an attention-rotation trick that wraps Q/K/V in a Hadamard transform before attention, then rotates the output back. The goal is a low-risk way to reduce quantization damage in the KV cache, with author benchmarks showing q8_0 coming very close to F16.
// ANALYSIS
This is the kind of pragmatic infrastructure improvement that matters more than flashy algorithm names: a simple baseline that preserves quality while making cache quantization safer.
- –The implementation is intentionally minimal and backend-agnostic, so it fits llama.cpp’s broad hardware story without introducing new tensor types
- –Benchmarks in the PR show q8_0 essentially matching F16 on several models, while lower-bit caches also improve versus the pre-rotation baseline
- –It is not full TurboQuant; the author explicitly notes missing pieces like PolarQuant and QJL, so this is a strong baseline rather than the end state
- –For long-context users, the practical win is cheaper KV cache memory with less quality regression, which is the main pain point these tricks are trying to solve
- –MLA is not supported yet, so the current upside is strongest for the model families llama.cpp already handles well
// TAGS
llminferenceopen-sourcebenchmarkllama-cpp
DISCOVERED
10d ago
2026-04-01
PUBLISHED
10d ago
2026-04-01
RELEVANCE
9/ 10
AUTHOR
Dany0