YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Position Bias Benchmark exposes LLM judges

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Position Bias Benchmark exposes LLM judges
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Position Bias Benchmark exposes LLM judges

LLM Position Bias Benchmark tests whether judge models keep the same preference when two similar story variants are shown in swapped order. Across 193 verified pairs and 27 judge models, the median model flipped its underlying choice in 44.8% of decisive cases, with GPT-5.4 high reasoning showing the strongest first-position bias.

// ANALYSIS

This is a sharp reminder that LLM-as-judge pipelines can look objective while quietly measuring prompt layout.

  • The benchmark isolates a practical eval failure: pairwise judges often choose the first displayed answer even when the same pair is reversed.
  • GPT-5.4 high reasoning is the warning case here, with 82.3% first-shown picks and a 66.3% order-flip rate.
  • ByteDance Seed2.0 Pro and DeepSeek V3.2 look comparatively cleaner, while Xiaomi MiMo V2 Pro’s low flip rate comes with much lower decisive coverage.
  • For developers running evals, single-pass pairwise judging should be treated as contaminated unless answer order is randomized, counterbalanced, or aggregated across both swaps.
// TAGS
llm-position-bias-benchmarkllmbenchmarktestingresearchsafety

DISCOVERED

45d ago

2026-04-21

PUBLISHED

45d ago

2026-04-21

RELEVANCE

8/ 10

AUTHOR

zero0_one1