OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
LLM Racing Games pits models head-to-head
LLM Racing Games is an interactive browser demo comparing how different models build a racing game from the same prompt, then evolve it over a few bug-fix turns. The post is less a polished benchmark than a messy but revealing stress test of model behavior across coding, planning, and browser-tool use.
// ANALYSIS
This is the kind of comparison that’s valuable precisely because it’s imperfect: it exposes not just output quality, but how models behave under iterative, tool-using coding workflows.
- –The results read like a qualitative benchmark for agentic coding, not a strict eval, which makes the differences more interesting than a simple score table.
- –The post highlights distinct failure modes: regressions, overlong code dumps, broken tool setups, invisible track logic, and one model that only improved after Playwright MCP was accidentally disabled.
- –The strongest signal is variance in execution style, not just end-state polish: some models edited incrementally, others rewrote everything, and some leaned into hidden structure or side effects.
- –It also shows how much the evaluation setup matters. Vision, browser tooling, and prompt iteration all materially changed outcomes, so apples-to-apples comparisons are only partly achievable.
- –As a shareable artifact, it’s compelling because people can play the demos themselves and judge the tradeoffs directly rather than trusting a static leaderboard.
// TAGS
llmai-codingbenchmarkagentcomputer-usetestingllm-racing-games
DISCOVERED
4h ago
2026-04-21
PUBLISHED
8h ago
2026-04-21
RELEVANCE
8/ 10
AUTHOR
FatheredPuma81