OPEN_SOURCE ↗
YT · YOUTUBE// 27d agoRESEARCH PAPER
Alibaba tops AIME25 with agentic data synthesis
Alibaba and SJTU's Agentic Proposing framework trains a 4B proposer model to synthesize high-difficulty reasoning problems by composing modular skills as a sequential agentic decision process. A 30B downstream solver trained on just 11,000 agent-synthesized trajectories hits 91.6% on AIME 2025 — rivaling frontier proprietary models.
// ANALYSIS
The real bottleneck for reasoning model training has always been data quality, not model scale — and this paper makes the case more forcefully than most with a surprisingly small 11K trajectory count.
- –The core insight: treat problem synthesis as a multi-step agentic task (Draft → Check → Refine → Finalize) with a library of composable atomic skills, so the proposer can build genuinely hard, verifiable problems rather than trivially easy or unsolvable ones
- –Multi-Granularity Policy Optimization (MGPO) is the RL secret sauce — it combines trajectory-level and stage-level advantage estimates to handle sparse rewards during proposer training, outperforming standard GRPO by 6.5 points
- –Cross-domain generalization is notable: gains aren't just on math (AIME, HMMT) but extend to coding (LiveCodeBench +5pts), science (OlympicArena +4.4%), and general reasoning (GPQA +6.3%)
- –The 91.6% AIME25 figure needs scrutiny — no independent replication yet, and the GitHub repo is an empty placeholder; the results hinge on their proprietary verifier ensemble (gpt-oss-120b, DeepSeek-V3.2-Special, Qwen3-235B-Thinking)
- –If the code/weights release materializes, this could meaningfully shift how labs approach cold-start synthetic data pipelines for reasoning models
// TAGS
agentic-proposingllmreasoningfine-tuningbenchmarkagentresearch
DISCOVERED
27d ago
2026-03-15
PUBLISHED
27d ago
2026-03-15
RELEVANCE
8/ 10
AUTHOR
Discover AI