Mercury 2 enters production on Baseten
Mercury 2 is a proprietary diffusion large language model (LLM) developed by Inception Labs that generates tokens in parallel to achieve inference speeds exceeding 1,000 tokens per second on NVIDIA GPUs. Designed for high-speed reasoning and agentic workflows, the model is now available in production on Baseten, with early adopters like Augment Code reporting a 90% reduction in inference costs.
Diffusion-based text generation is a promising paradigm shift that bypasses the sequential bottlenecks of traditional autoregressive models, enabling real-time agentic reasoning at scale.
* Generating 1,000+ tokens per second on standard GPUs dramatically reduces the latency floor for complex multi-agent reasoning loops.
* A 90% cost reduction makes high-frequency model calls economically viable for enterprise coding and reasoning applications.
* Deployment on Baseten simplifies the production serving and scaling process for developer integrations.
DISCOVERED
2h ago
2026-06-11
PUBLISHED
2h ago
2026-06-11
RELEVANCE
AUTHOR
_inception_ai