OPEN_SOURCE ↗
YT · YOUTUBE// 36d agoMODEL RELEASE
Mercury 2 hits 1,000 tok/sec
Inception Labs has launched Mercury 2, a diffusion-based reasoning LLM that generates through parallel refinement instead of next-token decoding. The pitch is simple but important for production teams: 1,009 tokens/sec on NVIDIA Blackwell GPUs, 128K context, native tool use, structured JSON output, and an OpenAI-compatible API aimed at latency-sensitive AI workloads.
// ANALYSIS
Mercury 2 is one of the clearest shots yet at the autoregressive status quo: if the quality holds up in real workloads, speed stops being a UX tax and starts becoming a product advantage.
- –The real story is not just headline throughput, but that diffusion-based generation changes the latency curve for agent loops, coding copilots, voice systems, and RAG pipelines.
- –Inception is positioning Mercury 2 as a drop-in API replacement, which lowers adoption friction for teams already built around OpenAI-style tooling.
- –The model looks strongest for structured output, search, real-time interaction, and coding assistance, where sub-second responsiveness matters more than squeezing out every last point of frontier reasoning quality.
- –This launch also puts pressure on mainstream model vendors to show better speed-quality tradeoffs, not just bigger benchmark numbers.
- –Outside commentary already frames Mercury 2 as part of a likely hybrid future, where diffusion models handle fast draft generation and slower autoregressive models handle high-stakes refinement.
// TAGS
mercury-2llmreasoninginferenceapi
DISCOVERED
36d ago
2026-03-06
PUBLISHED
36d ago
2026-03-06
RELEVANCE
10/ 10
AUTHOR
AI Revolution