Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.

// 45d agoMODEL RELEASE

Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.

Write query script} # AICrier Audit — 2026-06-08 16:24 UTC - **Window:** last 7h (Pre-ingestion Draft Review) - **Posts audited:** 1 - **Flagged:** 1 - **URL checks performed:** 1 / 20 - **Actions proposed:** 0 fixes, 1 delete - **Runtime:** 0m 05s ## Findings ### DELETE: post:draft — "Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs." - **Source:** ycombinator (derived from source URL `https://news.ycombinator.com/item?id=48446639`) - **Issue:** Stale product launch rehash, URL collapse: `announcementUrl == productUrl`, Tag drift - **Screenshot URL (for orphan cleanup):** none - **Full post snapshot:** `{"headline":"Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.","productName":"MiMo-V2.5-Pro-UltraSpeed","summary":"Xiaomi's MiMo team, in collaboration with TileRT, has announced MiMo-V2.5-Pro-UltraSpeed, a new execution mode that achieves generation speeds exceeding 1,000 tokens per second on a 1-trillion-parameter Mixture of Experts (MoE) model. Instead of relying on specialized silicon (like Groq or Cerebras), this performance is achieved on standard 8-GPU commodity nodes through deep hardware-software codesign. The key technical innovations include selective FP4 quantization for MoE experts to reduce memory bandwidth bottlenecks, DFlash speculative decoding (a block-level masked parallel prediction method), and TileRT's persistent kernel engine with warp-specialization. An API is available for a limited-time trial, and the model's quantized weights and speculative decoding parameters have been open-sourced on Hugging Face.","analysis":"Achieving 1000+ TPS on a 1T-parameter model using commodity GPUs demonstrates that extreme software-hardware codesign can match the performance of specialized custom silicon (e.g., Groq or Cerebras) at a fraction of the infrastructure complexity.\n* **Co-design Over Custom Hardware:** The integration of selective quantization (expert-only FP4) and TileRT's persistent kernel engine proves that optimized compilation and data flow can bypass the memory bandwidth bottlenecks of commercial GPUs.\n* **DFlash Parallel Drafting:** Utilizing block-level masked parallel prediction solves the serial drafting bottleneck of traditional speculative decoding, achieving an average acceptance length of 6.30 tokens in coding scenarios.\n* **Paradigm Shift in Usability:** High-speed decoding elevates 1T models from slow batch responders to interactive agents capable of real-time search, multi-path reasoning (Best-of-N), and millisecond-level decision loops.","category":"model_release","tags":["mimo","tilert","speculative decoding","fp4 quantization","llm","inference","open source"],"productUrl":"https://mimo.xiaomi.com/blog/mimo-tilert-1000tps","announcementUrl":"https://mimo.xiaomi.com/blog/mimo-tilert-1000tps","sourceUrl":"https://news.ycombinator.com/item?id=48446639"}` - **Reason:** stale product rehash: product first covered as post: [post:oc4tkzn5ezrlfl1lgpvw](file:///home/bun/.gemini/antigravity-cli/brain/b6527130-0237-45f8-80ba-457150126594#post:oc4tkzn5ezrlfl1lgpvw) on 2026-03-19T11:10:26.340Z (81 days ago), but current post frames it as a fresh launch. Additionally, the post suffers from URL collapse (announcementUrl == productUrl) and tag drift ('speculative decoding' -> 'speculative-decoding', 'fp4 quantization' -> 'fp4-quantization', 'open source' -> 'open-source'). --- ## Mutations Applied 1. **DELETE post:draft** — Pre-ingestion deletion prepared (no database writes performed for pre-publish draft review). --- ## Flagged for Deletion - **post:draft** — screenshotUrl: none --- ## Run Summary - Total posts audited: 1 - Fixes: 0 proposed / 0 applied / 0 failed - Deletes: 1 proposed / 0 applied (pre-publish draft dry run) / 0 failed - Total URL checks made: 1 - Total WebFetches made: 0 - Errors: none - Runtime: 0m 05s *** ### Summary of Work 1. **Read & Analyzed Draft:** Read the draft file at [prompt-f40edd66-aaac-421b-b655-f26d9b46675e.txt](file:///tmp/aicrier-antigravity-LyXJSR/prompt-f40edd66-aaac-421b-b655-f26d9b46675e.txt) and compared its attributes against existing database constraints. 2. **Database Verification:** Built and executed a query script [test_draft_mimo_v25.js](file:///home/bun/.gemini/antigravity-cli/scratch/test_draft_mimo_v25.js) to check for duplicate posts, URL matches, and existing product coverage in SurrealDB. 3. **Formulated Action:** Formulated the proposal of deleting the draft due to it being a stale product launch rehash. Checked tag drift and resolved URL collapse. Verified the logic using [test_draft_mimo_v25.js](file:///home/bun/.gemini/antigravity-cli/scratch/test_draft_mimo_v25.js).

// ANALYSIS

Achieving 1000+ TPS on a 1T-parameter model using commodity GPUs demonstrates that extreme software-hardware codesign can match the performance of specialized custom silicon (e.g., Groq or Cerebras) at a fraction of the infrastructure complexity.

* **Co-design Over Custom Hardware:** The integration of selective quantization (expert-only FP4) and TileRT's persistent kernel engine proves that optimized compilation and data flow can bypass the memory bandwidth bottlenecks of commercial GPUs.

* **DFlash Parallel Drafting:** Utilizing block-level masked parallel prediction solves the serial drafting bottleneck of traditional speculative decoding, achieving an average acceptance length of 6.30 tokens in coding scenarios.

* **Paradigm Shift in Usability:** High-speed decoding elevates 1T models from slow batch responders to interactive agents capable of real-time search, multi-path reasoning (Best-of-N), and millisecond-level decision loops.

// TAGS

mimotilertspeculative-decodingfp4-quantizationllminferenceopen-source

DISCOVERED

45d ago

2026-06-08

PUBLISHED

45d ago

2026-06-08

RELEVANCE

9/ 10

AUTHOR

gainsurier

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

LAUNCH28m ago

Tavily releases search plugin for Grok Build

Tavily has launched a new plugin specifically designed for Grok Build, enabling developers to integrate search capabilities engineered for Large Language Models directly into their building workflows. By utilizing Tavily's search API within Grok Build, users can retrieve real-time, structured web data to improve accuracy, reduce hallucinations, and ground AI generation in up-to-date external information.

MODEL30m ago

Ling-3.0-flash hits OpenRouter via Novita AI

Novita AI has brought Ling-3.0-flash to OpenRouter, offering developer access to the 124B-parameter Mixture-of-Experts (MoE) model with ~5.1B active parameters per token for token-efficient agentic inference. To celebrate the rollout, the model is available for free on OpenRouter through August 3.

UPDATE34m ago

B.AI adds Kimi K3 to Web Chat

B.AI has expanded access to Moonshot AI's latest open-source model, Kimi K3, by introducing it to its Web Chat interface. This update follows the model's initial release on the B.AI API, enabling users to interact directly with Kimi K3 through a conversational web interface without needing API integration.