BACK_TO_FEEDAICRIER_2
LawBreaker benchmark catches LLMs breaking physics laws
OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoBENCHMARK RESULT

LawBreaker benchmark catches LLMs breaking physics laws

LawBreaker is an open-source benchmark that procedurally generates adversarial physics questions and grades model answers with symbolic math, not an LLM judge. Its first leaderboard shows even strong Gemini variants slipping on unit handling, formula traps, and Bernoulli’s Equation.

// ANALYSIS

This is the kind of benchmark LLM evals need more often: procedurally generated, hard to memorize, and graded with deterministic math instead of vibes.

  • The trap design is the point here, and it’s well-targeted: anchoring bias, unit confusion, and missing constants map directly to common model failure modes.
  • The spread between Gemini models is more interesting than the absolute scores; a flash-image preview outperforming the pro preview suggests robustness is highly task-specific.
  • Bernoulli’s Equation being a universal weak spot is a useful signal, not just a funny result: pressure/unit normalization and multi-step reasoning still look brittle.
  • Auto-pushing results to Hugging Face makes the benchmark easy to compare over time, which should help separate real progress from prompt-specific luck.
// TAGS
lawbreakerllmbenchmarkresearchtestingopen-source

DISCOVERED

14d ago

2026-03-29

PUBLISHED

14d ago

2026-03-29

RELEVANCE

8/ 10

AUTHOR

pacman-s-install