BACK_TO_FEEDAICRIER_2
MCTS distillation beats GRPO on LLM reasoning tasks
OPEN_SOURCE ↗
HN · HACKER_NEWS// 28d agoRESEARCH PAPER

MCTS distillation beats GRPO on LLM reasoning tasks

Researcher Ayush Tambde demonstrates that combining Monte Carlo Tree Search with online PPO distillation outperforms GRPO and Best-of-N sampling on combinatorial reasoning tasks, achieving 11.3% vs. 8.4% on the Countdown benchmark with Qwen-2.5-1.5B. The key insight: MCTS generates higher-quality training trajectories by searching over reasoning steps rather than individual tokens, and the distilled model runs at standard inference cost.

// ANALYSIS

Using MCTS as a training-time trajectory generator — not a runtime search — is an underexplored middle ground between test-time compute scaling and pure RL, and these results suggest it's worth taking seriously.

  • MCTS-distilled model hits 11.3% mean@16 vs. 8.4% for CISPO baseline and 7.7% for Best-of-N (N=64) — a meaningful gap on a small 1.5B model
  • Searching over reasoning steps rather than tokens is architecturally cleaner and more compute-efficient than token-level tree search
  • The distilled model incurs no extra inference cost — MCTS overhead is training-only, addressing the main practical objection to tree search methods
  • Parallel MCTS with virtual losses adds diversity to training trajectories, which may explain why GRPO (with its simpler sampling) leaves performance on the table
  • Key open question: whether gains persist at scale — the author explicitly flags this as a first experiment and notes the results may be small-model phenomena
// TAGS
llmreasoningresearchfine-tuningopen-source

DISCOVERED

28d ago

2026-03-15

PUBLISHED

28d ago

2026-03-15

RELEVANCE

7/ 10

AUTHOR

at2005