OPEN_SOURCE ↗
LOBSTERS · LOBSTERS// 40d agoPRODUCT LAUNCH
Mercury diffusion coder models hit 1,109 tok/s
Inception Labs’ Mercury paper introduces diffusion-based coding LLMs (Mini and Small) that generate tokens in parallel and report 1,109 and 737 tokens/sec on H100 GPUs. The work claims up to 10x throughput gains versus speed-optimized autoregressive models while staying competitive on coding quality benchmarks and Copilot Arena.
// ANALYSIS
This is a serious attempt to break the autoregressive latency ceiling for coding assistants, and the speed-quality tradeoff looks compelling if independent real-world evals keep holding.
- –The key technical bet is parallel denoising over discrete tokens, which attacks serial decode bottlenecks directly.
- –Reported throughput numbers are large enough to materially change UX for autocomplete, agent loops, and iterative coding chat.
- –Quality claims are strong but still benchmark-heavy, so production reliability across messy enterprise codebases is the next proof point.
- –If diffusion LLM serving matures, incumbent “fast” autoregressive coding models could face real pricing and latency pressure.
// TAGS
mercury-coderllmai-codinginferenceresearch
DISCOVERED
40d ago
2026-03-03
PUBLISHED
46d ago
2026-02-25
RELEVANCE
9/ 10