llama.cpp tests DeepSeek V3.2 support
A draft PR adds proof-of-concept support for DeepSeek V3.2 Exp, V3.2, and V3.2 Speciale in llama.cpp using DeepSeek Sparse Attention. It targets CPU and CUDA backends and includes testing quants, a dedicated chat template, and tuning notes for OOM-prone runs.
This is infrastructure work, not hype: the hard part is making a sparse-attention MoE model behave correctly inside a local inference stack that was not originally built for it. The PR adds the lightning indexer and DSA path DeepSeek V3.2 needs, so it is about faithful model support rather than just loading weights; the testing quants are enormous, so this targets cluster-scale or very high-memory rigs; the dedicated Jinja template and tokenizer conversion caveat show the port touches model architecture, formatting, and conversion tooling, not just runtime kernels; and the CUDA OOM guidance around `ubatch` and `-fitt` suggests the branch is usable for testers but still rough around the edges. Since it is still a draft PR, the main question is correctness and maintainability upstream, not whether the branch is interesting.
DISCOVERED
3h ago
2026-05-06
PUBLISHED
6h ago
2026-05-06
RELEVANCE
AUTHOR
fairydreaming