APEX Testing expands leaderboard with recent models

// 45d agoBENCHMARK RESULT

APEX Testing expands leaderboard with recent models

APEX Testing has updated its real-world coding benchmark to cover 70 tasks across 8 categories and 59 models, with ELO-style rankings and multi-judge scoring on actual repos. The site’s leaderboard now reflects newer frontier models, while a few runs are still incomplete and some local model entries are slated for addition.

// ANALYSIS

The interesting part here isn’t just the leaderboard shuffle, it’s the benchmark philosophy: real repos, real bugs, real feature work, and a scoring system meant to reward actual engineering behavior instead of demo polish.

–The benchmark is unusually grounded for this space: 70 tasks, 8 categories, and repo-level work that spans frontend, backend, refactoring, debugging, and from-scratch builds
–ELO plus Bradley-Terry/IRT-style adjustments make it feel more like an evolving competition than a static scorecard, which is closer to how agent capability actually changes over time
–The current gaps matter: incomplete runs for some models and planned BF16 additions mean this is a living dataset, not a final verdict
–Because it is solo-funded and costly to run, the project’s biggest limitation is also its strength: it’s expensive enough to stay honest, but that also means coverage will lag behind fast-moving model releases
–For people choosing coding agents, the value is less “who won” and more “which models survive messy, job-like work when the benchmark is not sanitized”

// TAGS

apex-testingbenchmarkevaluationai-codingcoding-agentagenttestingdevtool

DISCOVERED

45d ago

2026-05-23

PUBLISHED

45d ago

2026-05-23

RELEVANCE

8/ 10

AUTHOR

hauhau901

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

RESEARCH23m ago

Alibaba drops Wan-Streamer v0.2

Alibaba researchers have released Wan-Streamer v0.2, upgrading the native-streaming audio-visual model's output resolution to 640x368 while maintaining a low 200 ms signal-to-signal latency. The update introduces a hybrid "Thinker-Performer" architecture to scale video generation performance without introducing user-visible delay.

RESEARCH26m ago

PixWorld unifies 3D generation and reconstruction

PixWorld introduces a unified paradigm for 3D scene generation and reconstruction by training a two-stream Diffusion Transformer directly in pixel space. By avoiding latent-space representations and employing a geometry perception loss, the model prevents information loss while providing explicit 3D structural supervision.

LAUNCH1h ago

Snyk Launches MCP Server for AI Agents

Snyk has launched a Model Context Protocol (MCP) server to bring security scanning and remediation directly into AI-driven developer workflows. In partnership with Augment Code, Snyk demonstrated how the server allows AI agents like Cosmos to automatically detect vulnerabilities and generate pull request fixes in real-time.