REDDIT · REDDIT// 3h agoINFRASTRUCTURE

MiMo-V2.5 SGLang checkpoint requires TP=4

This Reddit thread points out a real deployment wrinkle in SGLang’s MiMo-V2.5 cookbook: the checkpoint uses a TP=4-interleaved fused `qkv_proj`, so attention TP must stay at 4 within each DP group and a plain `--tp 8` load will fail unless paired with the matching DP setting. In practice, that means the current SGLang path expects GPU counts divisible by 4, which is a runtime/layout constraint rather than a statement that the model is somehow “too big” for smaller hardware in every possible serving stack.

// ANALYSIS

Hot take: this is a framework compatibility constraint, not a universal model law.

–The cookbook’s warning is credible: SGLang is enforcing the checkpoint’s shard layout, so `--tp` alone is not enough for MiMo-V2.5.
–The “multiple of 4” rule comes from attention-parallel grouping, not from raw VRAM alone.
–A 2-GPU setup can still be the wrong shape even if memory would otherwise fit after quantization, because the checkpoint expects a specific TP/DP topology.
–The downside is real for small clusters: users with enough aggregate memory may still be blocked by the serving stack’s parallelism requirements.
–The upside is that this is not necessarily permanent; another runtime or future checkpoint format could relax the constraint.

// TAGS

mimo-v2.5xiaomisglanginferencetensor-parallelismgpuquantizationdeployment

DISCOVERED

3h ago

2026-05-01

PUBLISHED

5h ago

2026-05-01

RELEVANCE

7/ 10

AUTHOR

Pyrenaeda