OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoMODEL RELEASE
DeepSeek-V4 slashes KV cache usage with CSA, HCA
DeepSeek-V4 introduces a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to reduce KV cache requirements by over 90%. These architectural breakthroughs enable 1-million-token context windows on consumer and workstation hardware, effectively neutralizing the memory advantages of competing transformer-SSM hybrid models.
// ANALYSIS
DeepSeek-V4's interleaved attention layers represent a massive leap in long-context efficiency, moving beyond Multi-head Latent Attention (MLA) to near-constant memory overhead.
- –Compressed Sparse Attention (CSA) provides fine-grained reasoning via 4x KV compression and top-k retrieval, while Heavily Compressed Attention (HCA) offers global context through 128x compression.
- –Detailed calculations confirm a 7.9x to 11.3x reduction in KV cache storage compared to DeepSeek-V3.2, with the 1.6T Pro model requiring only 8.7GiB for a 1M token context.
- –This architecture allows the massive Pro model to run million-token contexts on 1.5TB RAM setups, while the Flash model remains viable on standard 256GB workstations.
- –By matching the memory footprint of SSMs within a transformer framework, DeepSeek has established a new efficiency benchmark that is likely to be adopted by major context-heavy derivatives like Kimi and Zhipu.
// TAGS
deepseek-v4llmattentioninferenceresearchmlops
DISCOVERED
3h ago
2026-04-26
PUBLISHED
5h ago
2026-04-26
RELEVANCE
10/ 10
AUTHOR
Ok_Warning2146