BACK_TO_FEEDAICRIER_2
BullshitBench v2 tracks model pushback
OPEN_SOURCE ↗
YT · YOUTUBE// 37d agoBENCHMARK RESULT

BullshitBench v2 tracks model pushback

Peter Gostev’s open BullshitBench benchmark has expanded to 100 nonsense prompts across five domains and now ships with a live v2 explorer and leaderboard. Instead of measuring raw knowledge, it tests whether models clearly reject broken premises, partially challenge them, or confidently accept nonsense.

// ANALYSIS

BullshitBench matters because it measures a failure mode most benchmark suites barely touch: models that sound smart while endorsing nonsense. That makes it unusually relevant for anyone building AI systems that need judgment, not just fluent output.

  • The v2 dataset broadens coverage from the original set to 100 prompts spanning software, finance, legal, medical, and physics
  • Its green/amber/red scoring is easy to interpret and maps well to real product risk: push back, hedge, or hallucinate confidently
  • The live explorer makes the benchmark more useful than a static paper by letting developers inspect model behavior, domain splits, and leaderboard movement directly
  • Early discussion around the release reinforces a key point for AI builders: more reasoning effort does not automatically mean better refusal behavior on bad premises
// TAGS
bullshitbenchllmreasoningbenchmarkresearchopen-source

DISCOVERED

37d ago

2026-03-06

PUBLISHED

37d ago

2026-03-06

RELEVANCE

8/ 10

AUTHOR

Income stream surfers