GPT-5.5 fails LiveBench agentic coding test
OpenAI's latest model, GPT-5.5, has debuted with a significant regression in agentic coding on the LiveBench benchmark, scoring 56.67 compared to GPT-5.4's 70.00. While marketed as OpenAI's strongest agentic model to date, the "Thinking" variant currently ranks 11th on the contamination-free leaderboard, falling behind both its predecessor and older models like GPT-5.1 Codex.
GPT-5.5's LiveBench failure exposes a critical trade-off between task completion efficiency and novel logic reasoning in OpenAI's architecture. The 13-point drop highlights a struggle with non-contaminated problems, suggesting the model may be over-optimized for existing datasets. While performance on Terminal-Bench 2.0 and SWE-Bench Pro remains high, a 40% reduction in token usage suggests optimization for cost and speed may be degrading deep reasoning. Community reports of "lazy" code completions and context drift further align with these benchmark regressions.
DISCOVERED
4h ago
2026-04-25
PUBLISHED
5h ago
2026-04-25
RELEVANCE
AUTHOR
Keybug