YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

BullshitBench v2 tracks model pushback

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

BullshitBench v2 tracks model pushback
OPEN LINK ↗
// 82d agoBENCHMARK RESULT

BullshitBench v2 tracks model pushback

Peter Gostev’s open BullshitBench benchmark has expanded to 100 nonsense prompts across five domains and now ships with a live v2 explorer and leaderboard. Instead of measuring raw knowledge, it tests whether models clearly reject broken premises, partially challenge them, or confidently accept nonsense.

// ANALYSIS

BullshitBench matters because it measures a failure mode most benchmark suites barely touch: models that sound smart while endorsing nonsense. That makes it unusually relevant for anyone building AI systems that need judgment, not just fluent output.

  • The v2 dataset broadens coverage from the original set to 100 prompts spanning software, finance, legal, medical, and physics
  • Its green/amber/red scoring is easy to interpret and maps well to real product risk: push back, hedge, or hallucinate confidently
  • The live explorer makes the benchmark more useful than a static paper by letting developers inspect model behavior, domain splits, and leaderboard movement directly
  • Early discussion around the release reinforces a key point for AI builders: more reasoning effort does not automatically mean better refusal behavior on bad premises
// TAGS
bullshitbenchllmreasoningbenchmarkresearchopen-source

DISCOVERED

82d ago

2026-03-06

PUBLISHED

82d ago

2026-03-06

RELEVANCE

8/ 10

AUTHOR

Income stream surfers