DeepSWE raises bar for coding benchmarks
DeepSWE is a new benchmark from Datacurve for evaluating frontier coding agents on original, long-horizon software engineering tasks. It focuses on contamination-free tasks written from scratch across 91 repositories and 5 languages, with hand-written verifiers and reference solutions that require substantially more code than older public benchmarks. The release also includes a leaderboard showing clearer separation among top models than saturated benchmarks usually do.
Hot take: this is less about a single score and more about exposing whether coding agents can actually handle real engineering work instead of benchmark-shaped bug fixes.
- –Original tasks reduce memorization risk and make the benchmark harder to game.
- –The workload is meaningfully larger: prompts are shorter, but solutions are much more extensive and multi-file.
- –Hand-written behavioral verifiers should be more trustworthy than checks that reward implementation details.
- –The leaderboard suggests frontier models are still separating on harder, longer-horizon work, which is the point of the benchmark.
DISCOVERED
2h ago
2026-05-27
PUBLISHED
2h ago
2026-05-26
RELEVANCE
AUTHOR
steipete