GPT-5.4 tops Extended Connections benchmark

// 141d agoBENCHMARK RESULT

GPT-5.4 tops Extended Connections benchmark

On Lech Mazur’s Extended NYT Connections benchmark, GPT-5.4 posts 94.0 in extra high mode and 92.0 in medium, beating GPT-5.2’s 88.6 and 71.4 on the same puzzle set. The no-reasoning score rises only modestly to 32.8 from 28.1, which points to most of the gain coming from stronger deliberate reasoning rather than raw pattern matching.

// ANALYSIS

GPT-5.4 looks meaningfully better on a puzzle-heavy reasoning benchmark, but the split between reasoning and no-reasoning modes is the real story for developers evaluating cost, latency, and capability tradeoffs.

–The medium-mode jump from 71.4 to 92.0 is huge and suggests OpenAI improved practical reasoning efficiency, not just max-effort performance.
–The benchmark uses 759 NYT Connections puzzles with extra trick words, so it is testing categorization and distractor resistance rather than straight factual recall.
–The weak no-reasoning score relative to reasoning modes reinforces how much structured inference still matters on deceptively simple language tasks.
–This is a strong directional signal, but it is still a niche third-party benchmark rather than a full proxy for coding, agent reliability, or production workloads.

// TAGS

gpt-5-4llmreasoningbenchmarkresearch

DISCOVERED

141d ago

2026-03-06

PUBLISHED

142d ago

2026-03-05

RELEVANCE

8/ 10

AUTHOR

zero0_one1

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

SECURITY3m ago

Kimi K3 demonstrates autonomous corporate network intrusion

A joint evaluation by the UK and US AI Security Institutes revealed that Moonshot AI's Kimi K3 model possesses significant offensive cyber capabilities. During testing, Kimi K3 successfully achieved multi-step corporate network intrusions in an entirely autonomous manner.

VIDEO2h ago

Lower reasoning effort boosts Claude Opus 5 performance

In a video evaluation by Every, testing shows that Anthropic's Claude Opus 5 performs significantly better when configured with medium or low reasoning effort rather than maximum thinking settings. While max reasoning is designed for heavy problem-solving, it frequently causes the model to overthink, over-complicate solutions, and introduce unnecessary errors.

VIDEO2h ago

Claude Opus 5 Lags Rivals in Developer Workflows

In a hands-on review by Every, Anthropic's high-capability Claude Opus 5 model is put to the test across real-world daily coding and autonomous developer workflows. Despite its advanced reasoning metrics and position as a frontier model, the analysis highlights practical friction points—including latency and cost-benefit trade-offs—that prevent it from displacing current daily drivers like GPT-5.6 and Claude Fable in active developer setups.