OPEN_SOURCE ↗
REDDIT · REDDIT// 38d agoNEWS
RLVR beats SFT on Qwen2.5-1.5B reasoning
An independent project trained Qwen2.5-1.5B-Instruct with GRPO-based RLVR and SFT on GSM8K, finding RLVR improved GSM8K by +11.9 while SFT reduced performance by -15.2. Across 388 checkpoints, RLVR also improved MATH scores, including in one-example setups, while SFT mainly improved output formatting rather than answer accuracy.
// ANALYSIS
This is a sharp reminder that objective-aligned RL can outperform naive fine-tuning on reasoning tasks, even at small model scale.
- –RLVR gains on both GSM8K and MATH suggest generalization beyond a single benchmark split.
- –SFT underperformance supports the claim that format imitation can overwrite useful pretrained reasoning behavior.
- –The test-set and one-example experiments surface useful signals about contamination risk and data efficiency.
- –Open release of code, checkpoints, and a queryable results database makes the findings unusually reproducible.
// TAGS
rlvr-vs-sft-qwen2.5-1.5bqwen2.5llmfine-tuningreasoningresearch
DISCOVERED
38d ago
2026-03-05
PUBLISHED
39d ago
2026-03-03
RELEVANCE
8/ 10
AUTHOR
jayminban