BACK_TO_FEEDAICRIER_2
Asymmetric KV cache causes Qwen 3.6 slowdowns
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE

Asymmetric KV cache causes Qwen 3.6 slowdowns

A newly identified performance killer in llama.cpp causes Qwen 3.6 27B to plummet from 40 to 8 tokens per second on multi-turn conversations. The issue is triggered by asymmetric quantization of the K and V caches, specifically when utilizing the Walsh-Hadamard Rotation feature.

// ANALYSIS

This is a classic local LLM trap where users try to optimize VRAM by mixing cache quantization types, only to accidentally trigger a massive performance regression.

  • Recent versions of llama.cpp introduced Walsh-Hadamard Rotation to improve quantized KV cache quality
  • Setting different quantization types for K and V (e.g., q8_0 for K and q4_0 for V) fails matrix alignment, causing extreme slowdowns
  • The fix is simple: ensure K and V caches use identical quantization types (e.g., both q8_0 or both q4_0)
  • The issue is further compounded by a bug in CUDA 13.2 that corrupts KV memory with this model, requiring a downgrade to CUDA 13.1
// TAGS
llama-cppqweninferencellmlocal-llmgpuvram

DISCOVERED

3h ago

2026-04-25

PUBLISHED

5h ago

2026-04-25

RELEVANCE

8/ 10

AUTHOR

gigachad_deluxe