Xiaomi MiMo-V2.5-Pro-UltraSpeed tops 1,000 TPS

// 45d agoOPENSOURCE RELEASE

Xiaomi MiMo-V2.5-Pro-UltraSpeed tops 1,000 TPS

Xiaomi's MiMo team and TileRT have released MiMo-V2.5-Pro-UltraSpeed, achieving decoding speeds over 1,000 tokens per second on a trillion-parameter MoE model using a single 8-GPU node. This ultra-fast serving mode is enabled by DFlash speculative decoding and MXFP4 quantization, and the model checkpoints are open-sourced on Hugging Face.

// ANALYSIS

While Xiaomi's 1,000+ TPS on a trillion-parameter model is a stellar engineering feat, claims that this is the first useful speculative decoding method deployed on a quasi-frontier model are overstated.

* Speeds exceeding 1,000 TPS on commodity 8-GPU nodes demonstrate that massive models can be served efficiently without specialized supercomputers.

* DFlash's block-level masked parallel prediction offers a significant throughput improvement over traditional autoregressive draft-then-verify loops.

* Selective quantization (MXFP4 for experts, FP8 for attention) strikes a critical balance between reducing memory bandwidth bottlenecks and maintaining reasoning capabilities.

* The release of Hugging Face checkpoints invites open-source validation of these performance claims in real-world scenarios.

* Veteran AI practitioners note that multi-token prediction and speculative decoding have been utilized in production for nearly two years, making the "first useful" claim historically inaccurate.

// TAGS

speculative-decodingmoellm-inferencequantizationopen-sourcexiaomi

DISCOVERED

45d ago

2026-06-08

PUBLISHED

45d ago

2026-06-08

RELEVANCE

8/ 10

AUTHOR

jeremyphoward

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE10m ago

Jeda.ai integrates top AI models for visual strategy

Jeda.ai has announced a major feature update to its visual AI workspace, introducing integration with next-generation AI models including Nano Banana, Gemini 2.5 Pro, GPT-5.6 Sol, Claude Opus 4.8, and DeepSeek V3.2. Targeted at strategic consultants and team leaders, the update enables users to rapidly transform unstructured discovery notes and brainstorming sessions into sharp, client-ready visual strategy workflows.

UPDATE50m ago

Anthropic Upgrades Claude Voice Mode Models

Anthropic has updated the voice mode within the Claude AI ecosystem with more capable models, improving real-time voice interactions for users. As reported across major tech news outlets including TechCrunch, this upgrade elevates the conversational depth and performance of Claude's speech interface.

INFRA1h ago

vLLM AFD plugin enables disaggregated MoE inference

The new experimental vLLM Attention-FFN Disaggregation (AFD) Plugin separates inference workloads into independently scalable services by decoupling Attention from Feed-Forward Network (FFN) layers. Instead of treating inference as a monolithic task, this plugin allows infrastructure to scale memory-heavy Attention processing and compute-heavy FFN execution separately, optimizing hardware efficiency for Mixture-of-Experts (MoE) models while maintaining vLLM's native serving interface.