Inception's Mercury 2, the first commercial-scale reasoning diffusion LLM, is now available for production deployment on Baseten.
Baseten has announced that Inception's Mercury 2 is now live on its platform, making it the first inference platform to deliver production-grade reasoning diffusion LLMs (dLLMs) to developers. Unlike traditional autoregressive models that generate tokens sequentially, Mercury 2 uses a diffusion architecture to generate and refine multiple tokens in parallel, enabling speeds of over 1,000 tokens per second on widely-deployed NVIDIA GPUs. Partners like Augment Code have already deployed Mercury 2 in production, achieving a 90% reduction in inference costs and an 82% drop in latency for critical workloads, while maintaining quality comparable to speed-optimized models like Claude 3 Haiku and GPT-5 mini.
Diffusion architectures represent a fundamental paradigm shift away from the token-by-token sequential bottleneck of autoregressive LLMs, proving that raw speed doesn't require specialized custom silicon.
- –**Parallel refinement**: By drafting the output and refining it over parallel passes, dLLMs bypass sequential generation constraints, making them architecturally faster at the core rather than relying on decoding patches.
- –**Massive cost and latency benefits**: Early production metrics from Augment Code (90% cost reduction, 82% latency drop) demonstrate that dLLMs can drastically improve the economics of high-throughput agent loops.
- –**Ideal for targeted agentic tasks**: While not a replacement for high-intelligence frontier models, its speed makes it perfect for sub-second tasks like tool routing, code completion, and real-time voice agents.
DISCOVERED
1h ago
2026-06-15
PUBLISHED
2h ago
2026-06-15
RELEVANCE
AUTHOR
phylera14