OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Bench 2 crowns Nemotron 3 Nano
Bench 2’s latest run compares five 3-4B models across finance, reasoning, and simple code on an 18GB M3 Pro. Nemotron 3 Nano wins overall and dominates finance, while Qwen 3.5 4B falls apart under the fixed 1024-token budget.
// ANALYSIS
This is a useful benchmark, but the fixed token cap is clearly part of the story: it measures model quality plus how efficiently each model thinks within budget. The result still says something real about size-class specialization, especially at 3-4B.
- –Nemotron 3 Nano is the standout because it stays within budget and hits 100% on finance, which is exactly where many small models usually wobble
- –Phi-4 Mini looks like the best balanced generalist, with strong finance and code plus a much less lopsided profile than the others
- –Granite4:3B and Nemotron 3 Nano split into coder vs reasoner personalities, which is a strong argument for task-specific model selection at this size
- –Qwen 3.5 4B’s low score looks more like a truncation failure than pure capability loss, so per-model budgets are the right next experiment
- –The benchmark is most interesting as a methodology warning: fixed budgets can distort comparisons between thinking and non-thinking models
// TAGS
bench-2benchmarkllmreasoningai-codingfinance
DISCOVERED
4h ago
2026-04-27
PUBLISHED
5h ago
2026-04-27
RELEVANCE
10/ 10
AUTHOR
FederalAnalysis420