OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoBENCHMARK RESULT
LawBreaker benchmark catches LLMs breaking physics laws
LawBreaker is an open-source benchmark that procedurally generates adversarial physics questions and grades model answers with symbolic math, not an LLM judge. Its first leaderboard shows even strong Gemini variants slipping on unit handling, formula traps, and Bernoulli’s Equation.
// ANALYSIS
This is the kind of benchmark LLM evals need more often: procedurally generated, hard to memorize, and graded with deterministic math instead of vibes.
- –The trap design is the point here, and it’s well-targeted: anchoring bias, unit confusion, and missing constants map directly to common model failure modes.
- –The spread between Gemini models is more interesting than the absolute scores; a flash-image preview outperforming the pro preview suggests robustness is highly task-specific.
- –Bernoulli’s Equation being a universal weak spot is a useful signal, not just a funny result: pressure/unit normalization and multi-step reasoning still look brittle.
- –Auto-pushing results to Hugging Face makes the benchmark easy to compare over time, which should help separate real progress from prompt-specific luck.
// TAGS
lawbreakerllmbenchmarkresearchtestingopen-source
DISCOVERED
14d ago
2026-03-29
PUBLISHED
14d ago
2026-03-29
RELEVANCE
8/ 10
AUTHOR
pacman-s-install