BACK_TO_FEEDAICRIER_2
Mistral Medium 3.5 strains 4x3090 rigs
OPEN_SOURCE ↗
REDDIT · REDDIT// 18h agoINFRASTRUCTURE

Mistral Medium 3.5 strains 4x3090 rigs

The post says Mistral Medium 3.5 only manages about 11 tokens/sec on a 4x RTX 3090 setup in llama.cpp, even with everything kept on GPU, and asks whether vLLM can run a quantized 128B model with a usable context window without blowing up VRAM. The real issue is the speed-versus-memory curve: on this class of model, context size and KV cache matter almost as much as weight quantization.

// ANALYSIS

vLLM is worth testing, but it is not a free speed win. Its distributed tensor parallelism and quantization support can lift throughput, yet a 128B dense model plus a large context window can still hit VRAM limits fast on 24GB cards.

  • vLLM supports multi-GPU tensor parallel serving and multiple quantization paths, including GGUF, AWQ, GPTQ, INT4/INT8, and FP8, so the setup is plausible in principle.
  • The likely tradeoff is aggregate throughput versus memory efficiency: vLLM tends to shine with batching and concurrent requests, while llama.cpp is often leaner for a single-stream, memory-tight setup.
  • For a model this large, KV cache is the hidden cost. A "decent" context can erase most of the gains from aggressive weight quantization.
  • The earlier Qwen 3.5 27B experience is the right intuition check: vLLM can be materially faster, but it usually pays for that with higher VRAM use.
  • The practical way to predict the tradeoff is to size the model around the target tensor-parallel layout, then inspect the GPU blocks available at the context length you actually need; that is the closest useful proxy for speed and memory on this hardware.
// TAGS
llminferencequantizationlong-contextopen-weightsvllmllama-cppmistral-medium-3-5

DISCOVERED

18h ago

2026-05-02

PUBLISHED

20h ago

2026-05-02

RELEVANCE

8/ 10

AUTHOR

Septerium