OPEN_SOURCE ↗
REDDIT · REDDIT// 36d agoBENCHMARK RESULT
CarWashBench flunks most frontier models
CarWashBench v0.1 is a tiny public benchmark built around harder variants of the classic car-wash trick question to test whether LLMs can reason past surface cues. Across eight frontier models and five runs per question, only Gemini 3.1 Pro and GLM 5.0 showed meaningful performance while most models scored 0%.
// ANALYSIS
Tiny benchmarks can be noisy, but this one is a sharp gut-check for whether flagship "reasoning" models are actually reasoning or just pattern-matching.
- –The benchmark is extremely small at just two questions, so the leaderboard is more signal flare than final verdict.
- –Even so, near-total failure from multiple top-tier models is notable because the task targets everyday common-sense reasoning rather than specialized knowledge.
- –Gemini 3.1 Pro standing well above the field suggests some models are better at escaping superficial heuristics on deceptively simple prompts.
- –If the author expands the question set without losing the adversarial framing, this could become a useful lightweight reasoning stress test.
// TAGS
carwashbenchllmbenchmarkreasoningresearch
DISCOVERED
36d ago
2026-03-07
PUBLISHED
36d ago
2026-03-07
RELEVANCE
7/ 10
AUTHOR
Eyelbee