DeepInfra hosts NVIDIA Nemotron-3 Ultra
DeepInfra has announced serverless hosting support for NVIDIA's 550-billion-parameter Nemotron-3 Ultra Mixture-of-Experts model. The integration delivers inference speeds exceeding 300 tokens per second for complex reasoning and enterprise-grade agentic workflows.
DeepInfra's rapid adoption of Nemotron-3 Ultra underscores the intense competition among AI inference hosts to serve frontier-level open-weights models at scale. By hosting this 550B parameter giant, DeepInfra democratizes access to enterprise-grade reasoning that would otherwise be cost-prohibitive and complex for individual developers to self-host.
* High-speed, high-density MoE inference showcases DeepInfra's engineering prowess, achieving over 300 tokens per second.
* The model's focus on long-running, autonomous agents signals a major industry shift from simple chat interfaces to complex, multi-step problem-solving systems.
* Hybrid architectures like Mamba-Transformer are proving to be highly competitive and resource-efficient against pure Transformer implementations at this scale.
DISCOVERED
1h ago
2026-06-01
PUBLISHED
1h ago
2026-06-01
RELEVANCE
AUTHOR
DeepInfra
