GPT-5.4 tops ZeroBench leaderboard

// 78d agoBENCHMARK RESULT

GPT-5.4 tops ZeroBench leaderboard

GPT-5.4 now leads ZeroBench, a hard multimodal reasoning benchmark built to stress contemporary vision-language models on near-impossible visual questions. The current leaderboard shows GPT-5.4 (xhigh) at 23% pass@5 and 8% pass^5, ahead of Gemini 3.1 Pro at 19% and 7%.

// ANALYSIS

This is a useful benchmark win because ZeroBench is still brutally hard, so even small gains usually reflect real progress in multimodal reasoning rather than leaderboard noise.

–ZeroBench was introduced as an “impossible” visual benchmark, and frontier models are only now starting to post non-trivial scores
–GPT-5.4 taking the top spot over Gemini 3.1 Pro suggests OpenAI is still highly competitive on image-heavy reasoning, not just text benchmarks
–The absolute scores remain low, which is the bigger story for developers: multimodal reasoning is improving, but it is nowhere near solved
–ZeroBench’s latest site update says its recent v3 wording tweaks did not affect scores, so this looks like a genuine model improvement rather than a benchmark reset

// TAGS

gpt-5-4llmmultimodalreasoningbenchmark

DISCOVERED

78d ago

2026-03-11

PUBLISHED

79d ago

2026-03-10

RELEVANCE

9/ 10

AUTHOR

Waiting4AniHaremFDVR

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO4h ago

Viral video teases Claude Opus 4.8

A viral video directed by Miguel07Code showcases impressive "hyperframes" camera movements, allegedly generated by Claude Opus 4.8. The post has sparked speculation about Claude's video generation capabilities.

LAUNCH4h ago

Browser Use Terminal launches Rust web-agent TUI

Browser Use Terminal is a new Rust-based TUI that lets developers automate and steer browser tasks directly from the command line. It combines a lightweight LLM harness with direct CDP control over Chrome for highly observable, interactive automation.

NEWS4h ago

Developer automates BTC trading with Claude, nets profit

A developer tasked Claude with a $20 budget to autonomously trade Bitcoin overnight, resulting in a completed script that successfully executed five trades for a $95 profit. The experiment showcases the increasing capability of LLMs to generate functional, profitable algorithmic trading systems with minimal oversight.