LamBench turns lambda calculus into coding test
LamBench is a 120-task benchmark that asks models to solve pure lambda-calculus programming problems in the Lamb language. Its first leaderboard already shows wide separation across frontier models, with GPT-5.4 at the top and several systems dropping to zero on some task families.
Sharp idea, but also a reminder that new benchmarks can be most useful before models get tuned to them. This one rewards symbolic reasoning, syntactic discipline, and exact execution more than familiar code-generation muscle.
- –The task set is unusually deep for a niche benchmark: Church and Scott encodings, lists, trees, ADTs, plus harder algorithms like SAT, FFT, Sudoku, and TSP.
- –Scoring is straightforward pass rate, with solution size tracked as a secondary metric, which makes the results easy to read and harder to hand-wave.
- –The current leaderboard is the real story: GPT-5.4 leads GPT-5.5, and the gap to mid-tier and open models is large, which suggests the benchmark is stress-testing a very specific skill mix.
- –Because the benchmark is still new and intentionally simple, the scores should be treated as a directional signal, not a broad verdict on general coding ability.
- –For teams working on reasoning-heavy agents or compiler-like synthesis, this is a useful addition to the eval stack; for ordinary app coding, it is probably too synthetic to stand alone.
DISCOVERED
45d ago
2026-04-24
PUBLISHED
45d ago
2026-04-24
RELEVANCE
AUTHOR
uniVocity