YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Factory benchmark crowns GPT-5.2 for reviews

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Factory benchmark crowns GPT-5.2 for reviews
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Factory benchmark crowns GPT-5.2 for reviews

Factory benchmarked 13 models on 50 real pull requests and found GPT-5.2 and Claude Opus 4.6 at the top, with GPT-5.2 winning on cost-performance. The bigger takeaway is that cheaper models are close enough to make multi-pass review economically compelling.

// ANALYSIS

This is less a frontier-model coronation than a pricing reality check: better code review is about calibration and throughput, not just bigger token bills. Factory’s results suggest most teams should optimize for ensembles and repeated passes, not a single expensive “best” model.

  • GPT-5.2 edged Opus 4.6 on F1 while costing far less per PR, so the best model is not the most expensive one.
  • Open-source and budget models like Kimi K2.5 and MiniMax M2.7 reached roughly 75-86% of frontier quality at a fraction of the cost.
  • The benchmark is reasonably grounded: 50 real PRs, five open-source repos, identical prompts, repeated runs, and a human-curated golden set.
  • Token spend did not map cleanly to quality; some models burned millions of tokens per PR without improving review results.
  • For Droid users, the practical win is model routing: pick cheaper models for broad coverage and reserve frontier models for hard cases.
// TAGS
factorycode-reviewbenchmarkpricingllmopen-sourceai-coding

DISCOVERED

45d ago

2026-04-29

PUBLISHED

45d ago

2026-04-29

RELEVANCE

9/ 10

AUTHOR

FactoryAI