RLVR beats SFT on Qwen2.5-1.5B reasoning

// 84d agoNEWS

RLVR beats SFT on Qwen2.5-1.5B reasoning

An independent project trained Qwen2.5-1.5B-Instruct with GRPO-based RLVR and SFT on GSM8K, finding RLVR improved GSM8K by +11.9 while SFT reduced performance by -15.2. Across 388 checkpoints, RLVR also improved MATH scores, including in one-example setups, while SFT mainly improved output formatting rather than answer accuracy.

// ANALYSIS

This is a sharp reminder that objective-aligned RL can outperform naive fine-tuning on reasoning tasks, even at small model scale.

–RLVR gains on both GSM8K and MATH suggest generalization beyond a single benchmark split.
–SFT underperformance supports the claim that format imitation can overwrite useful pretrained reasoning behavior.
–The test-set and one-example experiments surface useful signals about contamination risk and data efficiency.
–Open release of code, checkpoints, and a queryable results database makes the findings unusually reproducible.

// TAGS

rlvr-vs-sft-qwen2.5-1.5bqwen2.5llmfine-tuningreasoningresearch

DISCOVERED

84d ago

2026-03-05

PUBLISHED

85d ago

2026-03-03

RELEVANCE

8/ 10

AUTHOR

jayminban

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL3h ago

Anthropic drops Opus 4.8 for Claude Code

Anthropic has released Opus 4.8, integrating the new model into Claude Code with high-effort defaults for complex coding tasks. The update boosts SWE-bench Pro scores to 69.2% and drastically reduces unremarked flaws in generated code.

VIDEO3h ago

Google AI animates cardboard TPUs for I/O 2026

Google AI partners with director Laurie Rowan and Nexus Studios to create a promotional short film for Google I/O 2026. The project leverages AI models to animate physical materials like cardboard and markers into characters representing Tensor Processing Units.

MODEL3h ago

Claude Opus 4.8 drops with extended agentic autonomy

Anthropic has released Claude Opus 4.8, bringing improvements to agentic skills, reasoning, and coding capabilities at the exact same price. The update introduces sharper judgment, increased honesty about its task progress, and the ability to operate autonomously for much longer periods.