Mistral Medium 3.5 trails Qwen3.5 MoE
Tensor parallel roughly doubles Mistral Medium 3.5’s decode speed on 4x RTX 3080 20GB, but it still lands well behind Qwen3.5-122B-A10B in this setup. The big picture is that sparse MoE looks far more practical than a dense 128B model for local multi-GPU serving.
The new llama.cpp tensor-parallel path makes Mistral Medium 3.5 usable for chat, but it does not change the underlying economics: you still burn a lot of VRAM for modest output speed.
- –Mistral’s tg128 improves from 10.37 tok/s with layer split to 21.59 tok/s with tensor parallel, but prompt processing drops, so the win is uneven
- –Qwen3.5-122B-A10B is dramatically faster at decode in both configs here, around 53-60 tok/s, despite similar total parameter count
- –The MoE architecture matters: only a fraction of Qwen’s weights are active per token, which is exactly what this kind of 4-GPU consumer setup wants
- –llama.cpp tensor parallel does not appear to materially help Qwen’s generation speed in this benchmark, so runtime choice and model architecture both matter
- –vLLM still looks like the stronger serving stack for Qwen in this hardware class, but the KV cache tradeoff becomes the real constraint
DISCOVERED
45d ago
2026-05-04
PUBLISHED
45d ago
2026-05-04
RELEVANCE
AUTHOR
lly0571