OPEN_SOURCE ↗
REDDIT · REDDIT// 23d agoBENCHMARK RESULT
Autoresearch Finds RTX 5090 Sweet Spot
This Reddit post documents the hard-won path to making Karpathy’s Autoresearch behave on an RTX 5090/Blackwell setup. The stable recipe was to avoid the broken full-model compile path, keep the fused optimizer gains, use SDPA/CuDNN attention, and settle on a smaller total batch with a longer training budget.
// ANALYSIS
This reads less like a triumphant benchmark and more like a reminder that on bleeding-edge GPUs, “runs” and “runs well” are worlds apart.
- –The biggest early trap was a technically valid path that was catastrophically slow, which made MFU look better or worse depending on the denominator and obscured the real issue.
- –Higher per-device batch settings backfired, while `TOTAL_BATCH_SIZE = 2**17` emerged as the better operating region than larger or smaller alternatives.
- –The win came from stacking several practical fixes: fused optimizer compile where it helped, stable SDPA/CuDNN attention, and a longer `TIME_BUDGET = 1200` once the batch regime stabilized.
- –Automation mattered as much as model tuning; the benchmark/extract/strategize/rerun loop had its own failure modes around lock cleanup, completion hooks, and dispatch order.
- –The result is valuable because it turns a flaky setup into something reproducible enough to support real follow-on experiments, not just one-off hero runs.
// TAGS
autoresearchbenchmarkgpuautomationopen-sourcellmagent
DISCOVERED
23d ago
2026-03-20
PUBLISHED
23d ago
2026-03-20
RELEVANCE
8/ 10
AUTHOR
Delicious_Rule_438