REDDIT · REDDIT// 2h agoBENCHMARK RESULT

Mistral Medium 3.5 trails Qwen3.5 MoE

Tensor parallel roughly doubles Mistral Medium 3.5’s decode speed on 4x RTX 3080 20GB, but it still lands well behind Qwen3.5-122B-A10B in this setup. The big picture is that sparse MoE looks far more practical than a dense 128B model for local multi-GPU serving.

// ANALYSIS

The new llama.cpp tensor-parallel path makes Mistral Medium 3.5 usable for chat, but it does not change the underlying economics: you still burn a lot of VRAM for modest output speed.

–Mistral’s tg128 improves from 10.37 tok/s with layer split to 21.59 tok/s with tensor parallel, but prompt processing drops, so the win is uneven
–Qwen3.5-122B-A10B is dramatically faster at decode in both configs here, around 53-60 tok/s, despite similar total parameter count
–The MoE architecture matters: only a fraction of Qwen’s weights are active per token, which is exactly what this kind of 4-GPU consumer setup wants
–llama.cpp tensor parallel does not appear to materially help Qwen’s generation speed in this benchmark, so runtime choice and model architecture both matter
–vLLM still looks like the stronger serving stack for Qwen in this hardware class, but the KV cache tradeoff becomes the real constraint

// TAGS

llmopen-weightsmoequantizationbenchmarkgpumistral-medium-3-5qwen3-5-122b-a10b

DISCOVERED

2h ago

2026-05-04

PUBLISHED

6h ago

2026-05-04

RELEVANCE

9/ 10

AUTHOR

lly0571