REDDIT · REDDIT// 6h agoBENCHMARK RESULT

Mistral Medium 3.5 Posts Strong TBLite Score

A Reddit user benchmarked Mistral Medium 3.5 on TBLite, a lighter proxy for TerminalBench 2.0, and says it performs well for its size. The result is unofficial and single-run, but it adds to the case that Mistral’s new 128B model is materially better at agentic coding than prior Mistral releases.

// ANALYSIS

This is a useful directional signal, not a verdict, but it’s the kind that matters for people choosing models for terminal agents and tool-heavy workflows. The bigger story is that Mistral seems to have closed a lot of the gap on practical agentic behavior, not just chat or coding benchmarks.

–TBLite is not TerminalBench 2.0, so the number should be treated as a trend indicator rather than a direct substitute
–A single run can swing meaningfully, especially on agent benchmarks with tool-use variance
–The result matters because Medium 3.5 is open-weight and self-hostable, so strong agentic performance has deployment value beyond API users
–It fits Mistral’s own positioning: merged instruction, reasoning, and coding in one 128B dense model
–Compared with the post’s reference point, it suggests a clear step up from earlier Mistral models in terminal/tool reliability

// TAGS

mistral-medium-3.5benchmarkagentai-codingllmopen-weights

DISCOVERED

6h ago

2026-04-30

PUBLISHED

6h ago

2026-04-30

RELEVANCE

9/ 10

AUTHOR

Real_Ebb_7417