YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Mistral Medium 3.5 trails Qwen3.5 MoE

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Mistral Medium 3.5 trails Qwen3.5 MoE
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Mistral Medium 3.5 trails Qwen3.5 MoE

Tensor parallel roughly doubles Mistral Medium 3.5’s decode speed on 4x RTX 3080 20GB, but it still lands well behind Qwen3.5-122B-A10B in this setup. The big picture is that sparse MoE looks far more practical than a dense 128B model for local multi-GPU serving.

// ANALYSIS

The new llama.cpp tensor-parallel path makes Mistral Medium 3.5 usable for chat, but it does not change the underlying economics: you still burn a lot of VRAM for modest output speed.

  • Mistral’s tg128 improves from 10.37 tok/s with layer split to 21.59 tok/s with tensor parallel, but prompt processing drops, so the win is uneven
  • Qwen3.5-122B-A10B is dramatically faster at decode in both configs here, around 53-60 tok/s, despite similar total parameter count
  • The MoE architecture matters: only a fraction of Qwen’s weights are active per token, which is exactly what this kind of 4-GPU consumer setup wants
  • llama.cpp tensor parallel does not appear to materially help Qwen’s generation speed in this benchmark, so runtime choice and model architecture both matter
  • vLLM still looks like the stronger serving stack for Qwen in this hardware class, but the KV cache tradeoff becomes the real constraint
// TAGS
llmopen-weightsmoequantizationbenchmarkgpumistral-medium-3-5qwen3-5-122b-a10b

DISCOVERED

45d ago

2026-05-04

PUBLISHED

45d ago

2026-05-04

RELEVANCE

9/ 10

AUTHOR

lly0571