OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoBENCHMARK RESULT
Devstral Small 2 tops local codebench
A Reddit benchmark post says Devstral Small 2, especially Unsloth’s Q8 GGUF, is the first local model to clear 80% on the author’s Scaffold Bench and outperform their usual Qwen-based coding stack. The result looks strong for real codebase work, but the author also notes slower TPS and treats the benchmark as a work in progress.
// ANALYSIS
Interesting signal, but still a home-built benchmark, so the score matters more as directional evidence than as a universal ranking. The bigger takeaway is that Mistral may have finally shipped a local coding model that feels competitive for agentic code work instead of merely “good for local.”
- –Mistral’s own docs position Devstral Small 2 as an open-source model for tool use, multi-file edits, and codebase exploration, with a 256k context window and local deployment support.
- –The official launch claims 68.0% on SWE-bench Verified, so the Redditor’s stronger private benchmark result is plausible, but likely sensitive to task mix, prompts, and quantization.
- –The Q8 Unsloth GGUF is probably doing real work here; local coding performance often swings on quant quality, tool-calling behavior, and template correctness as much as raw parameter count.
- –Slow TPS is the expected tradeoff for a 24B local model, but if wall time stays acceptable and the fixes are high quality, it becomes a serious offline/private coding option.
- –Scaffold-style benchmarks can reward the exact behaviors they measure and miss regression risk, code style, and maintainability, so production testing still matters.
// TAGS
devstral-small-2llmai-codingbenchmarkopen-sourceself-hosted
DISCOVERED
2h ago
2026-04-30
PUBLISHED
2h ago
2026-04-30
RELEVANCE
9/ 10
AUTHOR
alphatrad