BACK_TO_FEEDAICRIER_2
Mistral Medium 3.5 Crawls on Strix Halo
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Mistral Medium 3.5 Crawls on Strix Halo

A LocalLLaMA user benchmarked Mistral Medium 3.5 on AMD Strix Halo and found it brutally slow: a 48k-token prompt plus about 4k thinking tokens took roughly two hours. Prompt evaluation ran at 9.76 tokens/sec, but generation fell to 2.10 tokens/sec, making this an overnight-only setup for long reasoning jobs.

// ANALYSIS

This is a useful reality check: a 128B dense open-weight model can be impressive and still feel unusable once you push it onto consumer-class hardware with a huge context load.

  • The bottleneck is not just the model size, but the combination of 128B dense weights, long context, and local inference overhead
  • Strix Halo’s unified memory makes this possible, but not fast enough for interactive codebase work
  • The numbers suggest Mistral Medium 3.5 is better suited to batch jobs, offline analysis, and queued agent runs than live back-and-forth prompting
  • For developers, the practical takeaway is that “self-hostable” and “pleasant to use” are still very different bars
  • The post also reinforces why quantization and server flags matter less than raw architecture when the model is this large
// TAGS
mistral-medium-3-5llmbenchmarkinferencelong-contextquantizationgpuself-hosted

DISCOVERED

4h ago

2026-05-03

PUBLISHED

5h ago

2026-05-03

RELEVANCE

9/ 10

AUTHOR

Zc5Gwu