NVIDIA has launched Nemotron 3 Ultra, a massive 550-billion parameter hybrid Mamba-Transformer model that is five times faster and 30% cheaper than comparable open models.
NVIDIA has unveiled Nemotron 3 Ultra, the flagship model in its new Nemotron 3 family of foundation models built specifically for complex reasoning, planning, and multi-step agentic workflows. Featuring 550 billion total parameters, the model is powered by a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture that utilizes LatentMoE token compression and NVFP4 training to activate only 55 billion parameters per token. This efficient design enables throughputs exceeding 300 tokens per second, a 1-million token context window, and up to 5x faster inference at 30% lower cost compared to existing open-weights models in its class.
NVIDIA is proving that massive models do not have to be computationally prohibitive; by combining Mamba, Transformer, and LatentMoE architectures, they have set a new standard for open-weight agentic compute efficiency.
- –**Architectural Triumph**: The hybrid Mamba-Transformer MoE structure showcases the viability of alternative state-space architectures (SSMs) for extreme-scale production models.
- –**Agentic Optimization**: With a 1-million token context window and high token throughput, Nemotron 3 Ultra is custom-tailored for multi-agent systems and deep research workloads.
- –**Enterprise Open-Weights Push**: By releasing a highly customizable, highly efficient 550B model, NVIDIA is positioning its ecosystem as the default engine for custom enterprise AI agents.
DISCOVERED
2h ago
2026-06-01
PUBLISHED
2h ago
2026-06-01
RELEVANCE
AUTHOR
jeremyphoward