BACK_TO_FEEDAICRIER_2
Alibaba tops AIME25 with agentic data synthesis
OPEN_SOURCE ↗
YT · YOUTUBE// 27d agoRESEARCH PAPER

Alibaba tops AIME25 with agentic data synthesis

Alibaba and SJTU's Agentic Proposing framework trains a 4B proposer model to synthesize high-difficulty reasoning problems by composing modular skills as a sequential agentic decision process. A 30B downstream solver trained on just 11,000 agent-synthesized trajectories hits 91.6% on AIME 2025 — rivaling frontier proprietary models.

// ANALYSIS

The real bottleneck for reasoning model training has always been data quality, not model scale — and this paper makes the case more forcefully than most with a surprisingly small 11K trajectory count.

  • The core insight: treat problem synthesis as a multi-step agentic task (Draft → Check → Refine → Finalize) with a library of composable atomic skills, so the proposer can build genuinely hard, verifiable problems rather than trivially easy or unsolvable ones
  • Multi-Granularity Policy Optimization (MGPO) is the RL secret sauce — it combines trajectory-level and stage-level advantage estimates to handle sparse rewards during proposer training, outperforming standard GRPO by 6.5 points
  • Cross-domain generalization is notable: gains aren't just on math (AIME, HMMT) but extend to coding (LiveCodeBench +5pts), science (OlympicArena +4.4%), and general reasoning (GPQA +6.3%)
  • The 91.6% AIME25 figure needs scrutiny — no independent replication yet, and the GitHub repo is an empty placeholder; the results hinge on their proprietary verifier ensemble (gpt-oss-120b, DeepSeek-V3.2-Special, Qwen3-235B-Thinking)
  • If the code/weights release materializes, this could meaningfully shift how labs approach cold-start synthetic data pipelines for reasoning models
// TAGS
agentic-proposingllmreasoningfine-tuningbenchmarkagentresearch

DISCOVERED

27d ago

2026-03-15

PUBLISHED

27d ago

2026-03-15

RELEVANCE

8/ 10

AUTHOR

Discover AI