OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoBENCHMARK RESULT
EsoLang-Bench exposes frontier models' reasoning limits
Researchers built a coding benchmark using esoteric languages (Brainfuck, Befunge-98, Whitespace) to separate genuine reasoning from training-data memorization. Across GPT-5.2, O4-mini, Gemini 3 Pro, Qwen3-235B, and Kimi K2, the best result was 11% — and every model scored 0% on anything above Easy difficulty.
// ANALYSIS
EsoLang-Bench is a methodological gut-punch to the benchmarking status quo: if your model "solves" code problems in languages it's never meaningfully seen, that's reasoning — and none of them can.
- –Models scoring 85–95% on HumanEval collapse to 0–11% on equivalent problems in esoteric languages, exposing how much benchmark performance is memorization
- –The failure modes are telling: in Brainfuck (some training data) models produce valid syntax but fail logic; in Whitespace (almost none) they can't even produce syntactically valid programs
- –Agentic systems (Claude Code, Codex) score 2–3x better, but the gains come from tighter feedback loops and context management — not anything resembling genuine reasoning transfer
- –Few-shot prompting gave +0.8 percentage points on average — statistically noise, showing the technique depends on existing training knowledge
- –The paper calls for more OOD evaluations where gaming has no economic incentive, which is a compelling design principle for the entire benchmarking community
// TAGS
esolang-benchbenchmarkllmreasoningresearch
DISCOVERED
27d ago
2026-03-16
PUBLISHED
27d ago
2026-03-15
RELEVANCE
8/ 10
AUTHOR
ShoddyIndependent883