BACK_TO_FEEDAICRIER_2
PhD-Zero lifts Qwen3-1.7B to 20% AIME25
OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoBENCHMARK RESULT

PhD-Zero lifts Qwen3-1.7B to 20% AIME25

A LocalLLaMA post reports that a PhD student used PhD-Zero, an autonomous R&D agent workflow, to tune Qwen3-1.7B Base from 0.0% to 20.0% on AIME25 in 48 hours across 11 mostly hands-off iterations. The author attributes the jump to thinking compression and an agent-detected training bug fix (loss_mask mismatch) that unlocked learning.

// ANALYSIS

Interesting signal for agentic model optimization, but it reads as an early proof-of-concept rather than a settled breakthrough.

  • The result suggests small models may benefit from shorter, cleaner reasoning traces instead of longer CoT.
  • The autonomous debugging claim is notable: finding and fixing a `qwen` vs `qwen3` masking issue is exactly the kind of tedious bottleneck agents can remove.
  • Even with the gain, 20.0% is still well below the cited reproduced Qwen-Thinking baseline (33.3%), so there is headroom before parity.
  • Evidence is currently a Reddit discussion plus project repo, so independent reproduction will matter before treating this as a robust benchmark shift.
// TAGS
phd-zeroqwen3-1.7bbenchmarkfine-tuningagentreasoningopen-source

DISCOVERED

25d ago

2026-03-17

PUBLISHED

25d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

Rare-Salt2588