GPT-5.4 tops Matt Maher's planning benchmark

// 119d agoBENCHMARK RESULT

GPT-5.4 tops Matt Maher's planning benchmark

ANNOUNCEMENT PRODUCT PRODUCT HUNT YOUTUBE

OpenAI's GPT-5.4 Thinking model achieved a record 95% score on Matt Maher's planning benchmark, outperforming all current models in complex requirement preservation. Optimized for long-horizon agentic workflows, the model features native computer-use capabilities and a 1M token context window.

// ANALYSIS

GPT-5.4 is the first model to truly "get" long-horizon planning, but the real story is the discovery that execution context significantly boosts planning quality.

–Achieved 95% on the Maher benchmark, setting a new bar for PRD requirement preservation.
–Features "interruptible thinking," allowing developers to steer model logic mid-generation without a full restart.
–Native computer-use success rate of 75.0% marks the first time an AI has officially beaten the human baseline in OSWorld.
–The "Something Stranger" finding suggests that tool mode (execution vs. planning) is as impactful as the model's raw IQ.
–Full MCP support and 1M context window make it the clear choice for complex, tool-heavy agentic systems.

// TAGS

gpt-5-4llmreasoningbenchmarkagentmcp

DISCOVERED

119d ago

2026-03-16

PUBLISHED

119d ago

2026-03-16

RELEVANCE

9/ 10

AUTHOR

Matt Maher

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL22m ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.

UPDATE1h ago

OpenRouter splits rankings by model weight

OpenRouter has updated its rankings platform by introducing separate leaderboards for open-weight and closed-weight models. This allows developers to track and compare usage statistics of proprietary, API-exclusive models against downloadable open-weight models.

UPDATE1h ago

Codex and Claude Code introduce advanced in-app browser capabilities, including multi-tab support and cookie imports, accelerating the shift toward autonomous computer use.

Codex has updated its in-app browser to support multiple tabs, cookie importing, and password persistence, with Anthropic's Claude Code quickly following with similar web-browsing capabilities. These upgrades allow AI agents to navigate authenticated sites and perform browser-based tasks alongside code editors and terminals. By embedding robust browser control directly into the agentic environment, developers can execute end-to-end workflows without leaving the command line or workspace app.