OPEN_SOURCE ↗
REDDIT · REDDIT// 37d agoBENCHMARK RESULT
GPT-5.4 tops Extended Connections benchmark
On Lech Mazur’s Extended NYT Connections benchmark, GPT-5.4 posts 94.0 in extra high mode and 92.0 in medium, beating GPT-5.2’s 88.6 and 71.4 on the same puzzle set. The no-reasoning score rises only modestly to 32.8 from 28.1, which points to most of the gain coming from stronger deliberate reasoning rather than raw pattern matching.
// ANALYSIS
GPT-5.4 looks meaningfully better on a puzzle-heavy reasoning benchmark, but the split between reasoning and no-reasoning modes is the real story for developers evaluating cost, latency, and capability tradeoffs.
- –The medium-mode jump from 71.4 to 92.0 is huge and suggests OpenAI improved practical reasoning efficiency, not just max-effort performance.
- –The benchmark uses 759 NYT Connections puzzles with extra trick words, so it is testing categorization and distractor resistance rather than straight factual recall.
- –The weak no-reasoning score relative to reasoning modes reinforces how much structured inference still matters on deceptively simple language tasks.
- –This is a strong directional signal, but it is still a niche third-party benchmark rather than a full proxy for coding, agent reliability, or production workloads.
// TAGS
gpt-5-4llmreasoningbenchmarkresearch
DISCOVERED
37d ago
2026-03-06
PUBLISHED
37d ago
2026-03-05
RELEVANCE
8/ 10
AUTHOR
zero0_one1