BACK_TO_FEEDAICRIER_2
Mercury diffusion coder models hit 1,109 tok/s
OPEN_SOURCE ↗
LOBSTERS · LOBSTERS// 40d agoPRODUCT LAUNCH

Mercury diffusion coder models hit 1,109 tok/s

Inception Labs’ Mercury paper introduces diffusion-based coding LLMs (Mini and Small) that generate tokens in parallel and report 1,109 and 737 tokens/sec on H100 GPUs. The work claims up to 10x throughput gains versus speed-optimized autoregressive models while staying competitive on coding quality benchmarks and Copilot Arena.

// ANALYSIS

This is a serious attempt to break the autoregressive latency ceiling for coding assistants, and the speed-quality tradeoff looks compelling if independent real-world evals keep holding.

  • The key technical bet is parallel denoising over discrete tokens, which attacks serial decode bottlenecks directly.
  • Reported throughput numbers are large enough to materially change UX for autocomplete, agent loops, and iterative coding chat.
  • Quality claims are strong but still benchmark-heavy, so production reliability across messy enterprise codebases is the next proof point.
  • If diffusion LLM serving matures, incumbent “fast” autoregressive coding models could face real pricing and latency pressure.
// TAGS
mercury-coderllmai-codinginferenceresearch

DISCOVERED

40d ago

2026-03-03

PUBLISHED

46d ago

2026-02-25

RELEVANCE

9/ 10