REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Bench 2 crowns Nemotron 3 Nano

Bench 2’s latest run compares five 3-4B models across finance, reasoning, and simple code on an 18GB M3 Pro. Nemotron 3 Nano wins overall and dominates finance, while Qwen 3.5 4B falls apart under the fixed 1024-token budget.

// ANALYSIS

This is a useful benchmark, but the fixed token cap is clearly part of the story: it measures model quality plus how efficiently each model thinks within budget. The result still says something real about size-class specialization, especially at 3-4B.

–Nemotron 3 Nano is the standout because it stays within budget and hits 100% on finance, which is exactly where many small models usually wobble
–Phi-4 Mini looks like the best balanced generalist, with strong finance and code plus a much less lopsided profile than the others
–Granite4:3B and Nemotron 3 Nano split into coder vs reasoner personalities, which is a strong argument for task-specific model selection at this size
–Qwen 3.5 4B’s low score looks more like a truncation failure than pure capability loss, so per-model budgets are the right next experiment
–The benchmark is most interesting as a methodology warning: fixed budgets can distort comparisons between thinking and non-thinking models

// TAGS

bench-2benchmarkllmreasoningai-codingfinance

DISCOVERED

4h ago

2026-04-27

PUBLISHED

5h ago

2026-04-27

RELEVANCE

10/ 10

AUTHOR

FederalAnalysis420