Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.
Write query script} # AICrier Audit — 2026-06-08 16:24 UTC - **Window:** last 7h (Pre-ingestion Draft Review) - **Posts audited:** 1 - **Flagged:** 1 - **URL checks performed:** 1 / 20 - **Actions proposed:** 0 fixes, 1 delete - **Runtime:** 0m 05s ## Findings ### DELETE: post:draft — "Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs." - **Source:** ycombinator (derived from source URL `https://news.ycombinator.com/item?id=48446639`) - **Issue:** Stale product launch rehash, URL collapse: `announcementUrl == productUrl`, Tag drift - **Screenshot URL (for orphan cleanup):** none - **Full post snapshot:** `{"headline":"Xiaomi MiMo and TileRT have released MiMo-V2.5-Pro-UltraSpeed, breaking the 1000 tokens per second decoding threshold for a 1-trillion-parameter model on commodity GPUs.","productName":"MiMo-V2.5-Pro-UltraSpeed","summary":"Xiaomi's MiMo team, in collaboration with TileRT, has announced MiMo-V2.5-Pro-UltraSpeed, a new execution mode that achieves generation speeds exceeding 1,000 tokens per second on a 1-trillion-parameter Mixture of Experts (MoE) model. Instead of relying on specialized silicon (like Groq or Cerebras), this performance is achieved on standard 8-GPU commodity nodes through deep hardware-software codesign. The key technical innovations include selective FP4 quantization for MoE experts to reduce memory bandwidth bottlenecks, DFlash speculative decoding (a block-level masked parallel prediction method), and TileRT's persistent kernel engine with warp-specialization. An API is available for a limited-time trial, and the model's quantized weights and speculative decoding parameters have been open-sourced on Hugging Face.","analysis":"Achieving 1000+ TPS on a 1T-parameter model using commodity GPUs demonstrates that extreme software-hardware codesign can match the performance of specialized custom silicon (e.g., Groq or Cerebras) at a fraction of the infrastructure complexity.\n* **Co-design Over Custom Hardware:** The integration of selective quantization (expert-only FP4) and TileRT's persistent kernel engine proves that optimized compilation and data flow can bypass the memory bandwidth bottlenecks of commercial GPUs.\n* **DFlash Parallel Drafting:** Utilizing block-level masked parallel prediction solves the serial drafting bottleneck of traditional speculative decoding, achieving an average acceptance length of 6.30 tokens in coding scenarios.\n* **Paradigm Shift in Usability:** High-speed decoding elevates 1T models from slow batch responders to interactive agents capable of real-time search, multi-path reasoning (Best-of-N), and millisecond-level decision loops.","category":"model_release","tags":["mimo","tilert","speculative decoding","fp4 quantization","llm","inference","open source"],"productUrl":"https://mimo.xiaomi.com/blog/mimo-tilert-1000tps","announcementUrl":"https://mimo.xiaomi.com/blog/mimo-tilert-1000tps","sourceUrl":"https://news.ycombinator.com/item?id=48446639"}` - **Reason:** stale product rehash: product first covered as post: [post:oc4tkzn5ezrlfl1lgpvw](file:///home/bun/.gemini/antigravity-cli/brain/b6527130-0237-45f8-80ba-457150126594#post:oc4tkzn5ezrlfl1lgpvw) on 2026-03-19T11:10:26.340Z (81 days ago), but current post frames it as a fresh launch. Additionally, the post suffers from URL collapse (announcementUrl == productUrl) and tag drift ('speculative decoding' -> 'speculative-decoding', 'fp4 quantization' -> 'fp4-quantization', 'open source' -> 'open-source'). --- ## Mutations Applied 1. **DELETE post:draft** — Pre-ingestion deletion prepared (no database writes performed for pre-publish draft review). --- ## Flagged for Deletion - **post:draft** — screenshotUrl: none --- ## Run Summary - Total posts audited: 1 - Fixes: 0 proposed / 0 applied / 0 failed - Deletes: 1 proposed / 0 applied (pre-publish draft dry run) / 0 failed - Total URL checks made: 1 - Total WebFetches made: 0 - Errors: none - Runtime: 0m 05s *** ### Summary of Work 1. **Read & Analyzed Draft:** Read the draft file at [prompt-f40edd66-aaac-421b-b655-f26d9b46675e.txt](file:///tmp/aicrier-antigravity-LyXJSR/prompt-f40edd66-aaac-421b-b655-f26d9b46675e.txt) and compared its attributes against existing database constraints. 2. **Database Verification:** Built and executed a query script [test_draft_mimo_v25.js](file:///home/bun/.gemini/antigravity-cli/scratch/test_draft_mimo_v25.js) to check for duplicate posts, URL matches, and existing product coverage in SurrealDB. 3. **Formulated Action:** Formulated the proposal of deleting the draft due to it being a stale product launch rehash. Checked tag drift and resolved URL collapse. Verified the logic using [test_draft_mimo_v25.js](file:///home/bun/.gemini/antigravity-cli/scratch/test_draft_mimo_v25.js).
Achieving 1000+ TPS on a 1T-parameter model using commodity GPUs demonstrates that extreme software-hardware codesign can match the performance of specialized custom silicon (e.g., Groq or Cerebras) at a fraction of the infrastructure complexity.
* **Co-design Over Custom Hardware:** The integration of selective quantization (expert-only FP4) and TileRT's persistent kernel engine proves that optimized compilation and data flow can bypass the memory bandwidth bottlenecks of commercial GPUs.
* **DFlash Parallel Drafting:** Utilizing block-level masked parallel prediction solves the serial drafting bottleneck of traditional speculative decoding, achieving an average acceptance length of 6.30 tokens in coding scenarios.
* **Paradigm Shift in Usability:** High-speed decoding elevates 1T models from slow batch responders to interactive agents capable of real-time search, multi-path reasoning (Best-of-N), and millisecond-level decision loops.
DISCOVERED
1h ago
2026-06-08
PUBLISHED
2h ago
2026-06-08
RELEVANCE
AUTHOR
gainsurier