OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoBENCHMARK RESULT
Mistral Medium 3.5 trails Qwen3.5 MoE
Tensor parallel roughly doubles Mistral Medium 3.5’s decode speed on 4x RTX 3080 20GB, but it still lands well behind Qwen3.5-122B-A10B in this setup. The big picture is that sparse MoE looks far more practical than a dense 128B model for local multi-GPU serving.
// ANALYSIS
The new llama.cpp tensor-parallel path makes Mistral Medium 3.5 usable for chat, but it does not change the underlying economics: you still burn a lot of VRAM for modest output speed.
- –Mistral’s tg128 improves from 10.37 tok/s with layer split to 21.59 tok/s with tensor parallel, but prompt processing drops, so the win is uneven
- –Qwen3.5-122B-A10B is dramatically faster at decode in both configs here, around 53-60 tok/s, despite similar total parameter count
- –The MoE architecture matters: only a fraction of Qwen’s weights are active per token, which is exactly what this kind of 4-GPU consumer setup wants
- –llama.cpp tensor parallel does not appear to materially help Qwen’s generation speed in this benchmark, so runtime choice and model architecture both matter
- –vLLM still looks like the stronger serving stack for Qwen in this hardware class, but the KV cache tradeoff becomes the real constraint
// TAGS
llmopen-weightsmoequantizationbenchmarkgpumistral-medium-3-5qwen3-5-122b-a10b
DISCOVERED
2h ago
2026-05-04
PUBLISHED
6h ago
2026-05-04
RELEVANCE
9/ 10
AUTHOR
lly0571