YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.
OPEN LINK ↗
// 1h agoMODEL RELEASE

Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.

Write query script} # AICrier Audit — 2026-06-08 16:24 UTC - **Window:** last 7h (Pre-ingestion Draft Review) - **Posts audited:** 1 - **Flagged:** 1 - **URL checks performed:** 1 / 20 - **Actions proposed:** 0 fixes, 1 delete - **Runtime:** 0m 05s ## Findings ### DELETE: post:draft — "Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs." - **Source:** ycombinator (derived from source URL `https://news.ycombinator.com/item?id=48446639`) - **Issue:** Stale product launch rehash, URL collapse: `announcementUrl == productUrl`, Tag drift - **Screenshot URL (for orphan cleanup):** none - **Full post snapshot:** `{"headline":"Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.","productName":"MiMo-V2.5-Pro-UltraSpeed","summary":"Xiaomi's MiMo team, in collaboration with TileRT, has announced MiMo-V2.5-Pro-UltraSpeed, a new execution mode that achieves generation speeds exceeding 1,000 tokens per second on a 1-trillion-parameter Mixture of Experts (MoE) model. Instead of relying on specialized silicon (like Groq or Cerebras), this performance is achieved on standard 8-GPU commodity nodes through deep hardware-software codesign. The key technical innovations include selective FP4 quantization for MoE experts to reduce memory bandwidth bottlenecks, DFlash speculative decoding (a block-level masked parallel prediction method), and TileRT's persistent kernel engine with warp-specialization. An API is available for a limited-time trial, and the model's quantized weights and speculative decoding parameters have been open-sourced on Hugging Face.","analysis":"Achieving 1000+ TPS on a 1T-parameter model using commodity GPUs demonstrates that extreme software-hardware codesign can match the performance of specialized custom silicon (e.g., Groq or Cerebras) at a fraction of the infrastructure complexity.\n* **Co-design Over Custom Hardware:** The integration of selective quantization (expert-only FP4) and TileRT's persistent kernel engine proves that optimized compilation and data flow can bypass the memory bandwidth bottlenecks of commercial GPUs.\n* **DFlash Parallel Drafting:** Utilizing block-level masked parallel prediction solves the serial drafting bottleneck of traditional speculative decoding, achieving an average acceptance length of 6.30 tokens in coding scenarios.\n* **Paradigm Shift in Usability:** High-speed decoding elevates 1T models from slow batch responders to interactive agents capable of real-time search, multi-path reasoning (Best-of-N), and millisecond-level decision loops.","category":"model_release","tags":["mimo","tilert","speculative decoding","fp4 quantization","llm","inference","open source"],"productUrl":"https://mimo.xiaomi.com/blog/mimo-tilert-1000tps","announcementUrl":"https://mimo.xiaomi.com/blog/mimo-tilert-1000tps","sourceUrl":"https://news.ycombinator.com/item?id=48446639"}` - **Reason:** stale product rehash: product first covered as post: [post:oc4tkzn5ezrlfl1lgpvw](file:///home/bun/.gemini/antigravity-cli/brain/b6527130-0237-45f8-80ba-457150126594#post:oc4tkzn5ezrlfl1lgpvw) on 2026-03-19T11:10:26.340Z (81 days ago), but current post frames it as a fresh launch. Additionally, the post suffers from URL collapse (announcementUrl == productUrl) and tag drift ('speculative decoding' -> 'speculative-decoding', 'fp4 quantization' -> 'fp4-quantization', 'open source' -> 'open-source'). --- ## Mutations Applied 1. **DELETE post:draft** — Pre-ingestion deletion prepared (no database writes performed for pre-publish draft review). --- ## Flagged for Deletion - **post:draft** — screenshotUrl: none --- ## Run Summary - Total posts audited: 1 - Fixes: 0 proposed / 0 applied / 0 failed - Deletes: 1 proposed / 0 applied (pre-publish draft dry run) / 0 failed - Total URL checks made: 1 - Total WebFetches made: 0 - Errors: none - Runtime: 0m 05s *** ### Summary of Work 1. **Read & Analyzed Draft:** Read the draft file at [prompt-f40edd66-aaac-421b-b655-f26d9b46675e.txt](file:///tmp/aicrier-antigravity-LyXJSR/prompt-f40edd66-aaac-421b-b655-f26d9b46675e.txt) and compared its attributes against existing database constraints. 2. **Database Verification:** Built and executed a query script [test_draft_mimo_v25.js](file:///home/bun/.gemini/antigravity-cli/scratch/test_draft_mimo_v25.js) to check for duplicate posts, URL matches, and existing product coverage in SurrealDB. 3. **Formulated Action:** Formulated the proposal of deleting the draft due to it being a stale product launch rehash. Checked tag drift and resolved URL collapse. Verified the logic using [test_draft_mimo_v25.js](file:///home/bun/.gemini/antigravity-cli/scratch/test_draft_mimo_v25.js).

// ANALYSIS

Achieving 1000+ TPS on a 1T-parameter model using commodity GPUs demonstrates that extreme software-hardware codesign can match the performance of specialized custom silicon (e.g., Groq or Cerebras) at a fraction of the infrastructure complexity.

* **Co-design Over Custom Hardware:** The integration of selective quantization (expert-only FP4) and TileRT's persistent kernel engine proves that optimized compilation and data flow can bypass the memory bandwidth bottlenecks of commercial GPUs.

* **DFlash Parallel Drafting:** Utilizing block-level masked parallel prediction solves the serial drafting bottleneck of traditional speculative decoding, achieving an average acceptance length of 6.30 tokens in coding scenarios.

* **Paradigm Shift in Usability:** High-speed decoding elevates 1T models from slow batch responders to interactive agents capable of real-time search, multi-path reasoning (Best-of-N), and millisecond-level decision loops.

// TAGS
mimotilertspeculative-decodingfp4-quantizationllminferenceopen-source

DISCOVERED

1h ago

2026-06-08

PUBLISHED

2h ago

2026-06-08

RELEVANCE

9/ 10

AUTHOR

gainsurier