YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

RLVR beats SFT on Qwen2.5-1.5B reasoning

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

RLVR beats SFT on Qwen2.5-1.5B reasoning
OPEN LINK ↗
// 84d agoNEWS

RLVR beats SFT on Qwen2.5-1.5B reasoning

An independent project trained Qwen2.5-1.5B-Instruct with GRPO-based RLVR and SFT on GSM8K, finding RLVR improved GSM8K by +11.9 while SFT reduced performance by -15.2. Across 388 checkpoints, RLVR also improved MATH scores, including in one-example setups, while SFT mainly improved output formatting rather than answer accuracy.

// ANALYSIS

This is a sharp reminder that objective-aligned RL can outperform naive fine-tuning on reasoning tasks, even at small model scale.

  • RLVR gains on both GSM8K and MATH suggest generalization beyond a single benchmark split.
  • SFT underperformance supports the claim that format imitation can overwrite useful pretrained reasoning behavior.
  • The test-set and one-example experiments surface useful signals about contamination risk and data efficiency.
  • Open release of code, checkpoints, and a queryable results database makes the findings unusually reproducible.
// TAGS
rlvr-vs-sft-qwen2.5-1.5bqwen2.5llmfine-tuningreasoningresearch

DISCOVERED

84d ago

2026-03-05

PUBLISHED

85d ago

2026-03-03

RELEVANCE

8/ 10

AUTHOR

jayminban