X · X// 3h agoBENCHMARK RESULT

Factory benchmark crowns GPT-5.2 for reviews

Factory benchmarked 13 models on 50 real pull requests and found GPT-5.2 and Claude Opus 4.6 at the top, with GPT-5.2 winning on cost-performance. The bigger takeaway is that cheaper models are close enough to make multi-pass review economically compelling.

// ANALYSIS

This is less a frontier-model coronation than a pricing reality check: better code review is about calibration and throughput, not just bigger token bills. Factory’s results suggest most teams should optimize for ensembles and repeated passes, not a single expensive “best” model.

–GPT-5.2 edged Opus 4.6 on F1 while costing far less per PR, so the best model is not the most expensive one.
–Open-source and budget models like Kimi K2.5 and MiniMax M2.7 reached roughly 75-86% of frontier quality at a fraction of the cost.
–The benchmark is reasonably grounded: 50 real PRs, five open-source repos, identical prompts, repeated runs, and a human-curated golden set.
–Token spend did not map cleanly to quality; some models burned millions of tokens per PR without improving review results.
–For Droid users, the practical win is model routing: pick cheaper models for broad coverage and reserve frontier models for hard cases.

// TAGS

factorycode-reviewbenchmarkpricingllmopen-sourceai-coding

DISCOVERED

3h ago

2026-04-29

PUBLISHED

3h ago

2026-04-29

RELEVANCE

9/ 10

AUTHOR

FactoryAI