RLVR beats SFT on Qwen2.5-1.5B reasoning
An independent project trained Qwen2.5-1.5B-Instruct with GRPO-based RLVR and SFT on GSM8K, finding RLVR improved GSM8K by +11.9 while SFT reduced performance by -15.2. Across 388 checkpoints, RLVR also improved MATH scores, including in one-example setups, while SFT mainly improved output formatting rather than answer accuracy.
This is a sharp reminder that objective-aligned RL can outperform naive fine-tuning on reasoning tasks, even at small model scale.
- –RLVR gains on both GSM8K and MATH suggest generalization beyond a single benchmark split.
- –SFT underperformance supports the claim that format imitation can overwrite useful pretrained reasoning behavior.
- –The test-set and one-example experiments surface useful signals about contamination risk and data efficiency.
- –Open release of code, checkpoints, and a queryable results database makes the findings unusually reproducible.
DISCOVERED
84d ago
2026-03-05
PUBLISHED
85d ago
2026-03-03
RELEVANCE
AUTHOR
jayminban