BACK_TO_FEEDAICRIER_2
Position Bias Benchmark exposes LLM judges
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT

Position Bias Benchmark exposes LLM judges

LLM Position Bias Benchmark tests whether judge models keep the same preference when two similar story variants are shown in swapped order. Across 193 verified pairs and 27 judge models, the median model flipped its underlying choice in 44.8% of decisive cases, with GPT-5.4 high reasoning showing the strongest first-position bias.

// ANALYSIS

This is a sharp reminder that LLM-as-judge pipelines can look objective while quietly measuring prompt layout.

  • The benchmark isolates a practical eval failure: pairwise judges often choose the first displayed answer even when the same pair is reversed.
  • GPT-5.4 high reasoning is the warning case here, with 82.3% first-shown picks and a 66.3% order-flip rate.
  • ByteDance Seed2.0 Pro and DeepSeek V3.2 look comparatively cleaner, while Xiaomi MiMo V2 Pro’s low flip rate comes with much lower decisive coverage.
  • For developers running evals, single-pass pairwise judging should be treated as contaminated unless answer order is randomized, counterbalanced, or aggregated across both swaps.
// TAGS
llm-position-bias-benchmarkllmbenchmarktestingresearchsafety

DISCOVERED

5h ago

2026-04-21

PUBLISHED

6h ago

2026-04-21

RELEVANCE

8/ 10

AUTHOR

zero0_one1