MiMo-V2.5 shines in 1M-context stress test
This Reddit post is a hands-on benchmark of Xiaomi MiMo’s open-source MiMo-V2.5 model running as an IQ3_S GGUF quantization in `llama-server` with a 1,048,576-token context window. The tester reports that, on an RTX 6000 96GB plus W7800 setup, MiMo-V2.5 feels faster and more stable than MiniMax at long context around 50k tokens, especially in prefill and interactive use with VS Code plus kilocode. The main issue noted is occasional looping at low temperature, which seems easier to mitigate with repetition penalty tweaks, restarts, or possibly a fixed seed.
Hot take: this reads like a genuine long-context win for MiMo-V2.5 in local inference, but not a clean victory yet because stability still depends on sampling settings.
- –The signal here is speed under extreme context, not just raw model quality: the poster says MiMo stays responsive where MiniMax drops off quickly.
- –The setup matters: this is a quantized GGUF build on Vulkan GPUs, so the result is more about deployability and throughput than the full-precision flagship model.
- –The loopiness is the main caveat. If that behavior is reproducible, it can erase the practical gains for code generation even when token throughput looks good.
- –This is still a useful data point for people trying to run 1M-context models locally, because it suggests MiMo-V2.5 may be easier to keep usable than some alternatives in the same regime.
- –The post is best read as an early benchmark result, not a final verdict, since the tester explicitly says they still need to push past 300k context before drawing broader conclusions.
DISCOVERED
3h ago
2026-05-09
PUBLISHED
5h ago
2026-05-09
RELEVANCE
AUTHOR
LegacyRemaster