LogicGraph targets multi-path reasoning blind spot
LogicGraph is a new benchmark for multi-path logical reasoning that tests whether LLMs can enumerate multiple valid proof routes instead of just landing on one correct answer. The paper introduces a 900-instance, solver-verified dataset with 2-19 valid proof paths per query plus a Prover9-backed evaluation pipeline that exposes how quickly even strong models collapse onto a narrow set of solutions.
LogicGraph matters because it shifts reasoning evals from “got the answer” to “explored the space,” which is much closer to how real agentic systems fail in practice.
- –Each problem comes with an exhaustive set of minimal proofs, making it possible to measure coverage and strategy diversity instead of only final-answer accuracy
- –The benchmark bakes in logical distractions and shared intermediate nodes, so models have to reason through competing valid routes rather than follow a single clean chain
- –The paper’s results show a sharp gap between convergent success and divergent exploration: top models can often find one proof, but still miss many alternatives as depth increases
- –The Prover9-based neuro-symbolic evaluator is a strong contribution on its own, since it checks step validity and proof reachability more rigorously than LLM-as-a-judge setups
- –For developers building reasoning agents, this is a useful warning that high benchmark accuracy can still hide brittle search behavior and premature commitment
DISCOVERED
83d ago
2026-03-06
PUBLISHED
83d ago
2026-03-06
RELEVANCE
AUTHOR
Discover AI