APEX Testing expands leaderboard with recent models
APEX Testing has updated its real-world coding benchmark to cover 70 tasks across 8 categories and 59 models, with ELO-style rankings and multi-judge scoring on actual repos. The site’s leaderboard now reflects newer frontier models, while a few runs are still incomplete and some local model entries are slated for addition.
The interesting part here isn’t just the leaderboard shuffle, it’s the benchmark philosophy: real repos, real bugs, real feature work, and a scoring system meant to reward actual engineering behavior instead of demo polish.
- –The benchmark is unusually grounded for this space: 70 tasks, 8 categories, and repo-level work that spans frontend, backend, refactoring, debugging, and from-scratch builds
- –ELO plus Bradley-Terry/IRT-style adjustments make it feel more like an evolving competition than a static scorecard, which is closer to how agent capability actually changes over time
- –The current gaps matter: incomplete runs for some models and planned BF16 additions mean this is a living dataset, not a final verdict
- –Because it is solo-funded and costly to run, the project’s biggest limitation is also its strength: it’s expensive enough to stay honest, but that also means coverage will lag behind fast-moving model releases
- –For people choosing coding agents, the value is less “who won” and more “which models survive messy, job-like work when the benchmark is not sanitized”
DISCOVERED
2h ago
2026-05-23
PUBLISHED
4h ago
2026-05-23
RELEVANCE
AUTHOR
hauhau901