Llama.cpp mixed KV cache precision hurts performance
Benchmarks on AMD hardware reveal that mixing precision for Key and Value caches (e.g., f16 K and q8_0 V) results in a massive 3x performance penalty during prompt processing. Uniform quantization remains essential for maintaining GPU kernel efficiency in local LLM inference, as mismatched memory layouts prevent the use of optimized symmetric kernels.
Clever precision mixing is a silent performance killer that breaks GPU kernel optimization paths.
- –Mismatched memory layouts prevent llama.cpp from using optimized, symmetric kernels on backends like Vulkan.
- –Prompt processing throughput dropped from ~952 t/s to ~334 t/s on a Radeon 6950XT when mixing types.
- –Token generation also sees a ~15% degradation, proving the bottleneck exists across the entire inference cycle.
- –The performance loss is not due to bandwidth, as uniform f16 performs nearly identically to uniform q8_0.
- –Developers should prioritize uniform quantization (-ctk and -ctv flags) to avoid breaking hardware acceleration.
DISCOVERED
60d ago
2026-03-28
PUBLISHED
60d ago
2026-03-28
RELEVANCE
AUTHOR
L3tum