Kwai-Keye drops 30B multimodal MoE with DSA attention
Kuaishou's Keye team released Keye-VL-2.0-30B-A3B, a 30B-parameter multimodal MoE that integrates DeepSeek Sparse Attention (DSA). The architecture bounds KV cache growth, enabling 256K-token context windows for multi-hour video analysis on consumer hardware.
Bringing DeepSeek Sparse Attention into a multimodal architecture solves the memory explosion problem that traditionally makes long-video reasoning prohibitively expensive.
- –DSA restructures how attention weights are stored, preventing the linear KV cache scaling that normally plagues long-context vision models
- –The MoE architecture only activates 3B parameters per forward pass, making local inference highly efficient
- –Early benchmarks suggest it matches Gemini 1.5 Flash on temporal grounding and outperforms larger open-weight models like Qwen3-VL-235B
- –The model introduces the first agent capabilities in the Keye series, supporting visual self-correction and tool use
DISCOVERED
2h ago
2026-05-26
PUBLISHED
5h ago
2026-05-26
RELEVANCE
AUTHOR
External_Mood4719