REDDIT · REDDIT// 2h agoBENCHMARK RESULT

Devstral Small 2 tops local codebench

A Reddit benchmark post says Devstral Small 2, especially Unsloth’s Q8 GGUF, is the first local model to clear 80% on the author’s Scaffold Bench and outperform their usual Qwen-based coding stack. The result looks strong for real codebase work, but the author also notes slower TPS and treats the benchmark as a work in progress.

// ANALYSIS

Interesting signal, but still a home-built benchmark, so the score matters more as directional evidence than as a universal ranking. The bigger takeaway is that Mistral may have finally shipped a local coding model that feels competitive for agentic code work instead of merely “good for local.”

–Mistral’s own docs position Devstral Small 2 as an open-source model for tool use, multi-file edits, and codebase exploration, with a 256k context window and local deployment support.
–The official launch claims 68.0% on SWE-bench Verified, so the Redditor’s stronger private benchmark result is plausible, but likely sensitive to task mix, prompts, and quantization.
–The Q8 Unsloth GGUF is probably doing real work here; local coding performance often swings on quant quality, tool-calling behavior, and template correctness as much as raw parameter count.
–Slow TPS is the expected tradeoff for a 24B local model, but if wall time stays acceptable and the fixes are high quality, it becomes a serious offline/private coding option.
–Scaffold-style benchmarks can reward the exact behaviors they measure and miss regression risk, code style, and maintainability, so production testing still matters.

// TAGS

devstral-small-2llmai-codingbenchmarkopen-sourceself-hosted

DISCOVERED

2h ago

2026-04-30

PUBLISHED

2h ago

2026-04-30

RELEVANCE

9/ 10

AUTHOR

alphatrad