OPEN_SOURCE ↗
REDDIT · REDDIT// 18h agoINFRASTRUCTURE
Mistral Medium 3.5 strains 4x3090 rigs
The post says Mistral Medium 3.5 only manages about 11 tokens/sec on a 4x RTX 3090 setup in llama.cpp, even with everything kept on GPU, and asks whether vLLM can run a quantized 128B model with a usable context window without blowing up VRAM. The real issue is the speed-versus-memory curve: on this class of model, context size and KV cache matter almost as much as weight quantization.
// ANALYSIS
vLLM is worth testing, but it is not a free speed win. Its distributed tensor parallelism and quantization support can lift throughput, yet a 128B dense model plus a large context window can still hit VRAM limits fast on 24GB cards.
- –vLLM supports multi-GPU tensor parallel serving and multiple quantization paths, including GGUF, AWQ, GPTQ, INT4/INT8, and FP8, so the setup is plausible in principle.
- –The likely tradeoff is aggregate throughput versus memory efficiency: vLLM tends to shine with batching and concurrent requests, while llama.cpp is often leaner for a single-stream, memory-tight setup.
- –For a model this large, KV cache is the hidden cost. A "decent" context can erase most of the gains from aggressive weight quantization.
- –The earlier Qwen 3.5 27B experience is the right intuition check: vLLM can be materially faster, but it usually pays for that with higher VRAM use.
- –The practical way to predict the tradeoff is to size the model around the target tensor-parallel layout, then inspect the GPU blocks available at the context length you actually need; that is the closest useful proxy for speed and memory on this hardware.
// TAGS
llminferencequantizationlong-contextopen-weightsvllmllama-cppmistral-medium-3-5
DISCOVERED
18h ago
2026-05-02
PUBLISHED
20h ago
2026-05-02
RELEVANCE
8/ 10
AUTHOR
Septerium