MCTS distillation beats GRPO on LLM reasoning tasks

// 118d agoRESEARCH PAPER

MCTS distillation beats GRPO on LLM reasoning tasks

Researcher Ayush Tambde demonstrates that combining Monte Carlo Tree Search with online PPO distillation outperforms GRPO and Best-of-N sampling on combinatorial reasoning tasks, achieving 11.3% vs. 8.4% on the Countdown benchmark with Qwen-2.5-1.5B. The key insight: MCTS generates higher-quality training trajectories by searching over reasoning steps rather than individual tokens, and the distilled model runs at standard inference cost.

// ANALYSIS

Using MCTS as a training-time trajectory generator — not a runtime search — is an underexplored middle ground between test-time compute scaling and pure RL, and these results suggest it's worth taking seriously.

–MCTS-distilled model hits 11.3% mean@16 vs. 8.4% for CISPO baseline and 7.7% for Best-of-N (N=64) — a meaningful gap on a small 1.5B model
–Searching over reasoning steps rather than tokens is architecturally cleaner and more compute-efficient than token-level tree search
–The distilled model incurs no extra inference cost — MCTS overhead is training-only, addressing the main practical objection to tree search methods
–Parallel MCTS with virtual losses adds diversity to training trajectories, which may explain why GRPO (with its simpler sampling) leaves performance on the table
–Key open question: whether gains persist at scale — the author explicitly flags this as a first experiment and notes the results may be small-model phenomena

// TAGS

llmreasoningresearchfine-tuningopen-source

DISCOVERED

118d ago

2026-03-15

PUBLISHED

118d ago

2026-03-15

RELEVANCE

7/ 10

AUTHOR

at2005

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE25m ago

Abacus AI integrates Supercomputer with agentic workflows

Abacus AI has integrated its Supercomputer with agentic workflows in Max Mode, giving LLMs like Fable 5 root access to a persistent Linux environment to execute, debug, and host full-stack applications autonomously.

VIDEO1h ago

Jobright launches AI job search copilot

Jobright is an AI-driven job search copilot that matches users with roles, generates tailored resumes, and tracks applications. It features a Chrome extension to autofill application forms and helps surface insider connections for referrals.

UPDATE2h ago

OpenAI launches ChatGPT browser, desktop automation

OpenAI has released new settings for ChatGPT that allow the assistant to browse the web autonomously and execute actions across local desktop applications. Powered by the new GPT-5.6 model family, these features transform ChatGPT from a text-based conversational partner into an agentic tool capable of navigating user environments to perform multi-step tasks.

MCTS distillation beats GRPO on LLM reasoning tasks