BullshitBench v2 tracks model pushback
Peter Gostev’s open BullshitBench benchmark has expanded to 100 nonsense prompts across five domains and now ships with a live v2 explorer and leaderboard. Instead of measuring raw knowledge, it tests whether models clearly reject broken premises, partially challenge them, or confidently accept nonsense.
BullshitBench matters because it measures a failure mode most benchmark suites barely touch: models that sound smart while endorsing nonsense. That makes it unusually relevant for anyone building AI systems that need judgment, not just fluent output.
- –The v2 dataset broadens coverage from the original set to 100 prompts spanning software, finance, legal, medical, and physics
- –Its green/amber/red scoring is easy to interpret and maps well to real product risk: push back, hedge, or hallucinate confidently
- –The live explorer makes the benchmark more useful than a static paper by letting developers inspect model behavior, domain splits, and leaderboard movement directly
- –Early discussion around the release reinforces a key point for AI builders: more reasoning effort does not automatically mean better refusal behavior on bad premises
DISCOVERED
82d ago
2026-03-06
PUBLISHED
82d ago
2026-03-06
RELEVANCE
AUTHOR
Income stream surfers