YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

DeepSeek-V4-Flash MTP quant tops 85 tok/s

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

DeepSeek-V4-Flash MTP quant tops 85 tok/s
OPEN LINK ↗
// 3h agoMODEL RELEASE

DeepSeek-V4-Flash MTP quant tops 85 tok/s

This Hugging Face checkpoint restores DeepSeek-V4-Flash’s stripped MTP head, requantizes the routed experts, and makes self-speculative decoding work in patched vLLM. On 2× RTX PRO 6000 Max-Q cards, it reports 85.52 tok/s at 524k context versus 52.85 tok/s for the no-MTP baseline.

// ANALYSIS

This is less a new model architecture than a serving-focused retrofit, but the throughput gain is real and the writeup is unusually transparent about the plumbing.

  • The main fix is restoring the MTP block that transformers silently drops, so `--speculative-config` stops being a no-op.
  • The reported speedup is stack-specific: TP=2, patched vLLM, Max-Q workstation GPUs, and `--disable-custom-all-reduce` are all required.
  • The GPTQ pass on the MTP routed experts appears to be a smaller but meaningful increment on top of the MTP self-speculation gain.
  • The release is most useful for people serving very long-context DeepSeek variants locally, not as a broad benchmark signal for the whole model class.
// TAGS
deepseek-v4-flash-acti-mtp-w4a16-fp8llmopen-weightsquantizationmoelong-contextreasoninginference

DISCOVERED

3h ago

2026-05-10

PUBLISHED

5h ago

2026-05-10

RELEVANCE

10/ 10

AUTHOR

Blahblahblakha