REDDIT · REDDIT// 18h agoINFRASTRUCTURE

Mistral Medium 3.5 strains 4x3090 rigs

The post says Mistral Medium 3.5 only manages about 11 tokens/sec on a 4x RTX 3090 setup in llama.cpp, even with everything kept on GPU, and asks whether vLLM can run a quantized 128B model with a usable context window without blowing up VRAM. The real issue is the speed-versus-memory curve: on this class of model, context size and KV cache matter almost as much as weight quantization.

// ANALYSIS

vLLM is worth testing, but it is not a free speed win. Its distributed tensor parallelism and quantization support can lift throughput, yet a 128B dense model plus a large context window can still hit VRAM limits fast on 24GB cards.

–vLLM supports multi-GPU tensor parallel serving and multiple quantization paths, including GGUF, AWQ, GPTQ, INT4/INT8, and FP8, so the setup is plausible in principle.
–The likely tradeoff is aggregate throughput versus memory efficiency: vLLM tends to shine with batching and concurrent requests, while llama.cpp is often leaner for a single-stream, memory-tight setup.
–For a model this large, KV cache is the hidden cost. A "decent" context can erase most of the gains from aggressive weight quantization.
–The earlier Qwen 3.5 27B experience is the right intuition check: vLLM can be materially faster, but it usually pays for that with higher VRAM use.
–The practical way to predict the tradeoff is to size the model around the target tensor-parallel layout, then inspect the GPU blocks available at the context length you actually need; that is the closest useful proxy for speed and memory on this hardware.

// TAGS

llminferencequantizationlong-contextopen-weightsvllmllama-cppmistral-medium-3-5

DISCOVERED

18h ago

2026-05-02

PUBLISHED

20h ago

2026-05-02

RELEVANCE

8/ 10

AUTHOR

Septerium