Leipzig Benchmark evaluates LLM mathematical reasoning
Compiled by 49 mathematicians at the Max Planck Institute, the Leipzig Benchmark is a dataset of 100 research-level mathematics questions designed to evaluate the reasoning capabilities of leading large language models. In multi-run and heavy-thinking evaluations, state-of-the-art models solved 98 percent of the benchmark's questions, showing significant progress in advanced mathematical reasoning.
While LLMs are close to mastering graduate-level mathematical reasoning, their success is highly probabilistic and relies heavily on extended thinking budgets rather than consistent, deterministic understanding.
* Thinking Budgets Trump Model Scale: The leap from Stage 1 (41 unsolved) to Stage 3 (only 2 unsolved) highlights that giving models extended time to "think" yields exponentially better reasoning than simply scaling model parameters.
* The LLM Inconsistency Problem: As seen in Stage 2, models like Claude Opus 4.7 solved certain questions correctly in only 1 to 3 out of 20 runs, meaning AI success in research math remains a roll of the dice.
* Math Engines vs. Code Execution: Disabling code execution tools prevented models from attempting fragile brute-force algorithms, forcing them to rely on abstract mathematical reasoning and resulting in better overall outcomes.
* AI Correcting the Experts: The AI-assisted review phase caught 16 errors and typos in the mathematicians' own submissions, demonstrating that LLMs can already act as valuable peer reviewers for human research.
DISCOVERED
2h ago
2026-06-06
PUBLISHED
4h ago
2026-06-06
RELEVANCE
AUTHOR
root-parent