Xiaomi MiMo-V2.5-Pro-UltraSpeed tops 1,000 TPS
Xiaomi's MiMo team and TileRT have released MiMo-V2.5-Pro-UltraSpeed, achieving decoding speeds over 1,000 tokens per second on a trillion-parameter MoE model using a single 8-GPU node. This ultra-fast serving mode is enabled by DFlash speculative decoding and MXFP4 quantization, and the model checkpoints are open-sourced on Hugging Face.
While Xiaomi's 1,000+ TPS on a trillion-parameter model is a stellar engineering feat, claims that this is the first useful speculative decoding method deployed on a quasi-frontier model are overstated.
* Speeds exceeding 1,000 TPS on commodity 8-GPU nodes demonstrate that massive models can be served efficiently without specialized supercomputers.
* DFlash's block-level masked parallel prediction offers a significant throughput improvement over traditional autoregressive draft-then-verify loops.
* Selective quantization (MXFP4 for experts, FP8 for attention) strikes a critical balance between reducing memory bandwidth bottlenecks and maintaining reasoning capabilities.
* The release of Hugging Face checkpoints invites open-source validation of these performance claims in real-world scenarios.
* Veteran AI practitioners note that multi-token prediction and speculative decoding have been utilized in production for nearly two years, making the "first useful" claim historically inaccurate.
DISCOVERED
1h ago
2026-06-08
PUBLISHED
1h ago
2026-06-08
RELEVANCE
AUTHOR
jeremyphoward