OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Mistral Medium 3.5 benchmarked on 3x3090
Reddit user shares real-world local inference speed for Mistral Medium 3.5 Q3_K_M on three RTX 3090s, with examples across Python, SVG, and HTML generation. It’s a practical datapoint for anyone judging whether a 128B open-weight model is usable on high-end consumer hardware.
// ANALYSIS
This is less a launch story than a deployment reality check: Mistral says Medium 3.5 can self-host on as few as four GPUs, and this post shows what that looks like in the wild.
- –Q3 quantization is doing the heavy lifting here; without it, a 128B model would be much harder to fit into 72GB VRAM.
- –The different prompt types matter: code, SVG, and HTML stress the model and runtime differently, so the screenshots are more useful than a single synthetic tokens/sec number.
- –For local AI builders, this is the core tradeoff: you get an open-weight frontier-ish model, but you still pay in GPU count, power, and tuning effort.
- –If the quality holds up at this speed, Medium 3.5 becomes interesting for self-hosted coding agents, not just chat demos.
// TAGS
mistral-medium-3-5llmopen-weightsquantizationinferencegpuself-hostedbenchmark
DISCOVERED
3h ago
2026-05-04
PUBLISHED
6h ago
2026-05-04
RELEVANCE
8/ 10
AUTHOR
jacek2023