OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT
Mistral Medium 3.5 Posts Strong TBLite Score
A Reddit user benchmarked Mistral Medium 3.5 on TBLite, a lighter proxy for TerminalBench 2.0, and says it performs well for its size. The result is unofficial and single-run, but it adds to the case that Mistral’s new 128B model is materially better at agentic coding than prior Mistral releases.
// ANALYSIS
This is a useful directional signal, not a verdict, but it’s the kind that matters for people choosing models for terminal agents and tool-heavy workflows. The bigger story is that Mistral seems to have closed a lot of the gap on practical agentic behavior, not just chat or coding benchmarks.
- –TBLite is not TerminalBench 2.0, so the number should be treated as a trend indicator rather than a direct substitute
- –A single run can swing meaningfully, especially on agent benchmarks with tool-use variance
- –The result matters because Medium 3.5 is open-weight and self-hostable, so strong agentic performance has deployment value beyond API users
- –It fits Mistral’s own positioning: merged instruction, reasoning, and coding in one 128B dense model
- –Compared with the post’s reference point, it suggests a clear step up from earlier Mistral models in terminal/tool reliability
// TAGS
mistral-medium-3.5benchmarkagentai-codingllmopen-weights
DISCOVERED
6h ago
2026-04-30
PUBLISHED
6h ago
2026-04-30
RELEVANCE
9/ 10
AUTHOR
Real_Ebb_7417