BACK_TO_FEEDAICRIER_2
DFlash speeds lossless speculative decoding
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoRESEARCH PAPER

DFlash speeds lossless speculative decoding

DFlash is a research project from Z Lab that applies a lightweight block diffusion model as the drafter in speculative decoding. By conditioning the draft model on target-model features and generating token blocks in parallel, it reports up to 6x lossless speedup overall and roughly 2.5x better speedup than EAGLE-3 on Qwen3-8B. The project ships a paper, GitHub repo, and Hugging Face model collection, with SGLang support for serving.

// ANALYSIS

This is a systems-first paper that makes diffusion useful by narrowing its job: not to replace the base LLM, but to draft blocks quickly while the verifier preserves exactness.

  • The key idea is practical: parallel block drafting matters more than chasing standalone generation quality.
  • Conditioning the drafter on target-model hidden features is the real unlock; it raises acceptance length without making the drafter huge.
  • The reported gains are strong for an inference optimization paper, especially because they stay lossless.
  • If the SGLang path is stable, this has a clearer route to real deployments than many speculative-decoding experiments.
  • The main question is generality: how much of the speedup survives across more models, longer contexts, and production traffic patterns.
// TAGS
llminferencespeculative decodingdiffusionopen sourcesglanghugging face

DISCOVERED

4d ago

2026-04-07

PUBLISHED

4d ago

2026-04-07

RELEVANCE

9/ 10

AUTHOR

Total-Resort-3120