BACK_TO_FEEDAICRIER_2
CarWashBench flunks most frontier models
OPEN_SOURCE ↗
REDDIT · REDDIT// 36d agoBENCHMARK RESULT

CarWashBench flunks most frontier models

CarWashBench v0.1 is a tiny public benchmark built around harder variants of the classic car-wash trick question to test whether LLMs can reason past surface cues. Across eight frontier models and five runs per question, only Gemini 3.1 Pro and GLM 5.0 showed meaningful performance while most models scored 0%.

// ANALYSIS

Tiny benchmarks can be noisy, but this one is a sharp gut-check for whether flagship "reasoning" models are actually reasoning or just pattern-matching.

  • The benchmark is extremely small at just two questions, so the leaderboard is more signal flare than final verdict.
  • Even so, near-total failure from multiple top-tier models is notable because the task targets everyday common-sense reasoning rather than specialized knowledge.
  • Gemini 3.1 Pro standing well above the field suggests some models are better at escaping superficial heuristics on deceptively simple prompts.
  • If the author expands the question set without losing the adversarial framing, this could become a useful lightweight reasoning stress test.
// TAGS
carwashbenchllmbenchmarkreasoningresearch

DISCOVERED

36d ago

2026-03-07

PUBLISHED

36d ago

2026-03-07

RELEVANCE

7/ 10

AUTHOR

Eyelbee