REDDIT · REDDIT// 2h agoUNCHANGED

AutoResearch scores 14% gain on transit corpus

A user applied Karpathy’s autoresearch loop to a 33M-token public transit corpus and reported a roughly 14% language-modeling improvement on an 80M-parameter transformer trained from scratch. The post’s main value is methodological: it also shows several apparent accuracy wins failed to replicate, which makes the validation setup more interesting than the raw score.

// ANALYSIS

This reads as a solid small-data methodology report, not a frontier model result. The strongest signal is that autoresearch can still discover useful training changes under tight wall-clock constraints, but only if the evaluation gate is stricter than the metric the agent can see. Halving batch size was the key gain because it traded batch stability for 3.6x more optimizer steps inside the same 5-minute budget. The 80M-parameter model appears to be the best fit for this hardware and time budget; larger models ran out of steps, smaller ones ran out of capacity. The hidden validation gate did real work: two dev-bpb improvements looked good to the agent but failed to generalize to the held-out surface. The replication pass matters more than the headline gain; most domain-accuracy deltas collapsed across seeds, which is exactly what you'd expect from 100-250 item eval sets. The most useful next step is a DAPT comparison, because it separates "random init plus search" from the much easier pretrained baseline.

// TAGS

autoresearchllmagentresearchbenchmarkopen-source

DISCOVERED

2h ago

2026-04-30

PUBLISHED

3h ago

2026-04-30

RELEVANCE

8/ 10

AUTHOR

MarsPassenger