OPEN_SOURCE ↗
HN · HACKER_NEWS// 28d agoRESEARCH PAPER
MCTS distillation beats GRPO on LLM reasoning tasks
Researcher Ayush Tambde demonstrates that combining Monte Carlo Tree Search with online PPO distillation outperforms GRPO and Best-of-N sampling on combinatorial reasoning tasks, achieving 11.3% vs. 8.4% on the Countdown benchmark with Qwen-2.5-1.5B. The key insight: MCTS generates higher-quality training trajectories by searching over reasoning steps rather than individual tokens, and the distilled model runs at standard inference cost.
// ANALYSIS
Using MCTS as a training-time trajectory generator — not a runtime search — is an underexplored middle ground between test-time compute scaling and pure RL, and these results suggest it's worth taking seriously.
- –MCTS-distilled model hits 11.3% mean@16 vs. 8.4% for CISPO baseline and 7.7% for Best-of-N (N=64) — a meaningful gap on a small 1.5B model
- –Searching over reasoning steps rather than tokens is architecturally cleaner and more compute-efficient than token-level tree search
- –The distilled model incurs no extra inference cost — MCTS overhead is training-only, addressing the main practical objection to tree search methods
- –Parallel MCTS with virtual losses adds diversity to training trajectories, which may explain why GRPO (with its simpler sampling) leaves performance on the table
- –Key open question: whether gains persist at scale — the author explicitly flags this as a first experiment and notes the results may be small-model phenomena
// TAGS
llmreasoningresearchfine-tuningopen-source
DISCOVERED
28d ago
2026-03-15
PUBLISHED
28d ago
2026-03-15
RELEVANCE
7/ 10
AUTHOR
at2005