OPEN_SOURCE ↗
REDDIT · REDDIT// 7h agoRESEARCH PAPER
Qwen 3.6 V-cache shrinks 3.5x via asymmetric quantization
A new asymmetric quantization technique for Qwen 3.6 reduces KV cache memory from 10.7GB to 6.9GB, enabling stable 1M token context windows on single GPUs. By maintaining high-precision Keys while aggressively quantizing Values to 2-bit or 3-bit, the method avoids the "softmax blowup" common in long-context models without sacrificing sequence information.
// ANALYSIS
Treating K and V as fundamentally different data types is the key to unlocking million-token inference on consumer-grade hardware.
- –Aggressive per-channel INT2/INT3 quantization on V-cache leverages its robustness as a smooth attention-weighted mixture.
- –High-precision K-cache preservation is critical to prevent RoPE-induced instability and repetitive outputs in long sequences.
- –Unlike H2O or token eviction, this method retains every token, which is essential for "needle-in-a-haystack" tasks and complex reasoning.
- –The success of this approach on Qwen 3.6 provides a scalable blueprint for optimizing other flagship models like Llama 3 or Mistral.
// TAGS
qwen-3-6llminferenceresearch
DISCOVERED
7h ago
2026-04-19
PUBLISHED
9h ago
2026-04-19
RELEVANCE
8/ 10
AUTHOR
ENIAC-85