Position Bias Benchmark exposes LLM judges

// 90d agoBENCHMARK RESULT

Position Bias Benchmark exposes LLM judges

LLM Position Bias Benchmark tests whether judge models keep the same preference when two similar story variants are shown in swapped order. Across 193 verified pairs and 27 judge models, the median model flipped its underlying choice in 44.8% of decisive cases, with GPT-5.4 high reasoning showing the strongest first-position bias.

// ANALYSIS

This is a sharp reminder that LLM-as-judge pipelines can look objective while quietly measuring prompt layout.

–The benchmark isolates a practical eval failure: pairwise judges often choose the first displayed answer even when the same pair is reversed.
–GPT-5.4 high reasoning is the warning case here, with 82.3% first-shown picks and a 66.3% order-flip rate.
–ByteDance Seed2.0 Pro and DeepSeek V3.2 look comparatively cleaner, while Xiaomi MiMo V2 Pro’s low flip rate comes with much lower decisive coverage.
–For developers running evals, single-pass pairwise judging should be treated as contaminated unless answer order is randomized, counterbalanced, or aggregated across both swaps.

// TAGS

llm-position-bias-benchmarkllmbenchmarktestingresearchsafety

DISCOVERED

90d ago

2026-04-21

PUBLISHED

90d ago

2026-04-21

RELEVANCE

8/ 10

AUTHOR

zero0_one1

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

KOPI AI Agent launches stock skill

KOPI AI Agent has introduced a new Stock Skill aimed at providing smarter stock analysis for the US and Hong Kong markets. The tool leverages the autonomous agent's capabilities in multi-turn reasoning and tool calling to synthesize cross-market movements and assist in investment decisions.

INFRA1h ago

Z.ai completes 1GW domestic chip data center

Z.ai (Zhipu AI) has completed construction of a massive 1-gigawatt AI data center powered entirely by domestic Chinese silicon. This major infrastructure milestone is specifically designed to train the company's next-generation GLM frontier models, signaling a significant leap forward in China's AI self-sufficiency in the face of ongoing U.S. export restrictions.

UPDATE1h ago

Qwen3.8-Max-Preview boosts web frontend coding

Alibaba's flagship 2.4-trillion-parameter Qwen 3.8 Max model is receiving continuous daily updates during its preview phase, with a particular focus on improving its web frontend code generation quality. As Alibaba's most powerful multimodal model to date, it aims to compete with leading frontier systems, with plans to eventually release it as an open-weight model.