YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Asymmetric KV cache causes Qwen 3.6 slowdowns

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Asymmetric KV cache causes Qwen 3.6 slowdowns
OPEN LINK ↗
// 45d agoINFRASTRUCTURE

Asymmetric KV cache causes Qwen 3.6 slowdowns

A newly identified performance killer in llama.cpp causes Qwen 3.6 27B to plummet from 40 to 8 tokens per second on multi-turn conversations. The issue is triggered by asymmetric quantization of the K and V caches, specifically when utilizing the Walsh-Hadamard Rotation feature.

// ANALYSIS

This is a classic local LLM trap where users try to optimize VRAM by mixing cache quantization types, only to accidentally trigger a massive performance regression.

  • Recent versions of llama.cpp introduced Walsh-Hadamard Rotation to improve quantized KV cache quality
  • Setting different quantization types for K and V (e.g., q8_0 for K and q4_0 for V) fails matrix alignment, causing extreme slowdowns
  • The fix is simple: ensure K and V caches use identical quantization types (e.g., both q8_0 or both q4_0)
  • The issue is further compounded by a bug in CUDA 13.2 that corrupts KV memory with this model, requiring a downgrade to CUDA 13.1
// TAGS
llama-cppqweninferencellmlocal-llmgpuvram

DISCOVERED

45d ago

2026-04-25

PUBLISHED

45d ago

2026-04-25

RELEVANCE

8/ 10

AUTHOR

gigachad_deluxe