DeepSeek-V4-Flash MTP quant tops 85 tok/s

// 3h agoMODEL RELEASE

DeepSeek-V4-Flash MTP quant tops 85 tok/s

This Hugging Face checkpoint restores DeepSeek-V4-Flash’s stripped MTP head, requantizes the routed experts, and makes self-speculative decoding work in patched vLLM. On 2× RTX PRO 6000 Max-Q cards, it reports 85.52 tok/s at 524k context versus 52.85 tok/s for the no-MTP baseline.

// ANALYSIS

This is less a new model architecture than a serving-focused retrofit, but the throughput gain is real and the writeup is unusually transparent about the plumbing.

–The main fix is restoring the MTP block that transformers silently drops, so `--speculative-config` stops being a no-op.
–The reported speedup is stack-specific: TP=2, patched vLLM, Max-Q workstation GPUs, and `--disable-custom-all-reduce` are all required.
–The GPTQ pass on the MTP routed experts appears to be a smaller but meaningful increment on top of the MTP self-speculation gain.
–The release is most useful for people serving very long-context DeepSeek variants locally, not as a broad benchmark signal for the whole model class.

// TAGS

deepseek-v4-flash-acti-mtp-w4a16-fp8llmopen-weightsquantizationmoelong-contextreasoninginference

DISCOVERED

3h ago

2026-05-10

PUBLISHED

5h ago

2026-05-10

RELEVANCE

10/ 10

AUTHOR

Blahblahblakha

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE46m ago

Oh My OpenAgent Adds Team Mode

Oh My OpenAgent v4.0.0 adds Team Mode, turning the open-source harness into a multi-agent system with coordinated specialists, parallel execution, and tmux-based visibility. The release also tightens model-specific prompts and reliability fixes across the agent runtime.

VIDEO3h ago

Bun Rust Rewrite Branch Sparks Debate

A Better Stack video spotlights Bun’s experimental Rust port branch and the porting guide mapping Bun’s Zig patterns to Rust equivalents. The branch is drawing attention because it looks surprisingly far along for a proof-of-concept, but it is still explicitly experimental.

NEWS4h ago

Nous Research hosts Hermes Agent Jam

Nous Research is inviting the Hermes Agent community into an interactive Discord jam to trade ideas and show off projects. It looks like a builder-focused session aimed at surfacing real-world workflows, demos, and feedback around the agent.