OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoBENCHMARK RESULT
PhD-Zero lifts Qwen3-1.7B to 20% AIME25
A LocalLLaMA post reports that a PhD student used PhD-Zero, an autonomous R&D agent workflow, to tune Qwen3-1.7B Base from 0.0% to 20.0% on AIME25 in 48 hours across 11 mostly hands-off iterations. The author attributes the jump to thinking compression and an agent-detected training bug fix (loss_mask mismatch) that unlocked learning.
// ANALYSIS
Interesting signal for agentic model optimization, but it reads as an early proof-of-concept rather than a settled breakthrough.
- –The result suggests small models may benefit from shorter, cleaner reasoning traces instead of longer CoT.
- –The autonomous debugging claim is notable: finding and fixing a `qwen` vs `qwen3` masking issue is exactly the kind of tedious bottleneck agents can remove.
- –Even with the gain, 20.0% is still well below the cited reproduced Qwen-Thinking baseline (33.3%), so there is headroom before parity.
- –Evidence is currently a Reddit discussion plus project repo, so independent reproduction will matter before treating this as a robust benchmark shift.
// TAGS
phd-zeroqwen3-1.7bbenchmarkfine-tuningagentreasoningopen-source
DISCOVERED
25d ago
2026-03-17
PUBLISHED
25d ago
2026-03-17
RELEVANCE
8/ 10
AUTHOR
Rare-Salt2588