OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Mistral Medium 3.5 Crawls on Strix Halo
A LocalLLaMA user benchmarked Mistral Medium 3.5 on AMD Strix Halo and found it brutally slow: a 48k-token prompt plus about 4k thinking tokens took roughly two hours. Prompt evaluation ran at 9.76 tokens/sec, but generation fell to 2.10 tokens/sec, making this an overnight-only setup for long reasoning jobs.
// ANALYSIS
This is a useful reality check: a 128B dense open-weight model can be impressive and still feel unusable once you push it onto consumer-class hardware with a huge context load.
- –The bottleneck is not just the model size, but the combination of 128B dense weights, long context, and local inference overhead
- –Strix Halo’s unified memory makes this possible, but not fast enough for interactive codebase work
- –The numbers suggest Mistral Medium 3.5 is better suited to batch jobs, offline analysis, and queued agent runs than live back-and-forth prompting
- –For developers, the practical takeaway is that “self-hostable” and “pleasant to use” are still very different bars
- –The post also reinforces why quantization and server flags matter less than raw architecture when the model is this large
// TAGS
mistral-medium-3-5llmbenchmarkinferencelong-contextquantizationgpuself-hosted
DISCOVERED
4h ago
2026-05-03
PUBLISHED
5h ago
2026-05-03
RELEVANCE
9/ 10
AUTHOR
Zc5Gwu