REDDIT · REDDIT// 3h agoMODEL RELEASE

DeepSeek-V4 slashes KV cache usage with CSA, HCA

DeepSeek-V4 introduces a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to reduce KV cache requirements by over 90%. These architectural breakthroughs enable 1-million-token context windows on consumer and workstation hardware, effectively neutralizing the memory advantages of competing transformer-SSM hybrid models.

// ANALYSIS

DeepSeek-V4's interleaved attention layers represent a massive leap in long-context efficiency, moving beyond Multi-head Latent Attention (MLA) to near-constant memory overhead.

–Compressed Sparse Attention (CSA) provides fine-grained reasoning via 4x KV compression and top-k retrieval, while Heavily Compressed Attention (HCA) offers global context through 128x compression.
–Detailed calculations confirm a 7.9x to 11.3x reduction in KV cache storage compared to DeepSeek-V3.2, with the 1.6T Pro model requiring only 8.7GiB for a 1M token context.
–This architecture allows the massive Pro model to run million-token contexts on 1.5TB RAM setups, while the Flash model remains viable on standard 256GB workstations.
–By matching the memory footprint of SSMs within a transformer framework, DeepSeek has established a new efficiency benchmark that is likely to be adopted by major context-heavy derivatives like Kimi and Zhipu.

// TAGS

deepseek-v4llmattentioninferenceresearchmlops

DISCOVERED

3h ago

2026-04-26

PUBLISHED

5h ago

2026-04-26

RELEVANCE

10/ 10

AUTHOR

Ok_Warning2146