OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
MiMo-V2.5 SGLang checkpoint requires TP=4
This Reddit thread points out a real deployment wrinkle in SGLang’s MiMo-V2.5 cookbook: the checkpoint uses a TP=4-interleaved fused `qkv_proj`, so attention TP must stay at 4 within each DP group and a plain `--tp 8` load will fail unless paired with the matching DP setting. In practice, that means the current SGLang path expects GPU counts divisible by 4, which is a runtime/layout constraint rather than a statement that the model is somehow “too big” for smaller hardware in every possible serving stack.
// ANALYSIS
Hot take: this is a framework compatibility constraint, not a universal model law.
- –The cookbook’s warning is credible: SGLang is enforcing the checkpoint’s shard layout, so `--tp` alone is not enough for MiMo-V2.5.
- –The “multiple of 4” rule comes from attention-parallel grouping, not from raw VRAM alone.
- –A 2-GPU setup can still be the wrong shape even if memory would otherwise fit after quantization, because the checkpoint expects a specific TP/DP topology.
- –The downside is real for small clusters: users with enough aggregate memory may still be blocked by the serving stack’s parallelism requirements.
- –The upside is that this is not necessarily permanent; another runtime or future checkpoint format could relax the constraint.
// TAGS
mimo-v2.5xiaomisglanginferencetensor-parallelismgpuquantizationdeployment
DISCOVERED
3h ago
2026-05-01
PUBLISHED
5h ago
2026-05-01
RELEVANCE
7/ 10
AUTHOR
Pyrenaeda