DeepSeek-V4-Flash MTP quant tops 85 tok/s
This Hugging Face checkpoint restores DeepSeek-V4-Flash’s stripped MTP head, requantizes the routed experts, and makes self-speculative decoding work in patched vLLM. On 2× RTX PRO 6000 Max-Q cards, it reports 85.52 tok/s at 524k context versus 52.85 tok/s for the no-MTP baseline.
This is less a new model architecture than a serving-focused retrofit, but the throughput gain is real and the writeup is unusually transparent about the plumbing.
- –The main fix is restoring the MTP block that transformers silently drops, so `--speculative-config` stops being a no-op.
- –The reported speedup is stack-specific: TP=2, patched vLLM, Max-Q workstation GPUs, and `--disable-custom-all-reduce` are all required.
- –The GPTQ pass on the MTP routed experts appears to be a smaller but meaningful increment on top of the MTP self-speculation gain.
- –The release is most useful for people serving very long-context DeepSeek variants locally, not as a broad benchmark signal for the whole model class.
DISCOVERED
3h ago
2026-05-10
PUBLISHED
5h ago
2026-05-10
RELEVANCE
AUTHOR
Blahblahblakha