OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
Asymmetric KV cache causes Qwen 3.6 slowdowns
A newly identified performance killer in llama.cpp causes Qwen 3.6 27B to plummet from 40 to 8 tokens per second on multi-turn conversations. The issue is triggered by asymmetric quantization of the K and V caches, specifically when utilizing the Walsh-Hadamard Rotation feature.
// ANALYSIS
This is a classic local LLM trap where users try to optimize VRAM by mixing cache quantization types, only to accidentally trigger a massive performance regression.
- –Recent versions of llama.cpp introduced Walsh-Hadamard Rotation to improve quantized KV cache quality
- –Setting different quantization types for K and V (e.g., q8_0 for K and q4_0 for V) fails matrix alignment, causing extreme slowdowns
- –The fix is simple: ensure K and V caches use identical quantization types (e.g., both q8_0 or both q4_0)
- –The issue is further compounded by a bug in CUDA 13.2 that corrupts KV memory with this model, requiring a downgrade to CUDA 13.1
// TAGS
llama-cppqweninferencellmlocal-llmgpuvram
DISCOVERED
3h ago
2026-04-25
PUBLISHED
5h ago
2026-04-25
RELEVANCE
8/ 10
AUTHOR
gigachad_deluxe