REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Mistral Medium 3.5 Crawls on Strix Halo

A LocalLLaMA user benchmarked Mistral Medium 3.5 on AMD Strix Halo and found it brutally slow: a 48k-token prompt plus about 4k thinking tokens took roughly two hours. Prompt evaluation ran at 9.76 tokens/sec, but generation fell to 2.10 tokens/sec, making this an overnight-only setup for long reasoning jobs.

// ANALYSIS

This is a useful reality check: a 128B dense open-weight model can be impressive and still feel unusable once you push it onto consumer-class hardware with a huge context load.

–The bottleneck is not just the model size, but the combination of 128B dense weights, long context, and local inference overhead
–Strix Halo’s unified memory makes this possible, but not fast enough for interactive codebase work
–The numbers suggest Mistral Medium 3.5 is better suited to batch jobs, offline analysis, and queued agent runs than live back-and-forth prompting
–For developers, the practical takeaway is that “self-hostable” and “pleasant to use” are still very different bars
–The post also reinforces why quantization and server flags matter less than raw architecture when the model is this large

// TAGS

mistral-medium-3-5llmbenchmarkinferencelong-contextquantizationgpuself-hosted

DISCOVERED

4h ago

2026-05-03

PUBLISHED

5h ago

2026-05-03

RELEVANCE

9/ 10

AUTHOR

Zc5Gwu