OPEN_SOURCE ↗
YT · YOUTUBE// 37d agoBENCHMARK RESULT
BullshitBench v2 tracks model pushback
Peter Gostev’s open BullshitBench benchmark has expanded to 100 nonsense prompts across five domains and now ships with a live v2 explorer and leaderboard. Instead of measuring raw knowledge, it tests whether models clearly reject broken premises, partially challenge them, or confidently accept nonsense.
// ANALYSIS
BullshitBench matters because it measures a failure mode most benchmark suites barely touch: models that sound smart while endorsing nonsense. That makes it unusually relevant for anyone building AI systems that need judgment, not just fluent output.
- –The v2 dataset broadens coverage from the original set to 100 prompts spanning software, finance, legal, medical, and physics
- –Its green/amber/red scoring is easy to interpret and maps well to real product risk: push back, hedge, or hallucinate confidently
- –The live explorer makes the benchmark more useful than a static paper by letting developers inspect model behavior, domain splits, and leaderboard movement directly
- –Early discussion around the release reinforces a key point for AI builders: more reasoning effort does not automatically mean better refusal behavior on bad premises
// TAGS
bullshitbenchllmreasoningbenchmarkresearchopen-source
DISCOVERED
37d ago
2026-03-06
PUBLISHED
37d ago
2026-03-06
RELEVANCE
8/ 10
AUTHOR
Income stream surfers