Karpathy's Autoresearch slashes eCLIP mean rank

// 112d agoRESEARCH PAPER

Karpathy's Autoresearch slashes eCLIP mean rank

Yogesh Kumar applied Karpathy's autoresearch loop to an old eCLIP research codebase, letting Claude Code iterate on `train.py` inside a locked-down containerized sandbox. In 42 runs over one Saturday, the agent cut validation mean rank by 54%, mostly by fixing a temperature clamp and retuning hyperparameters.

// ANALYSIS

This is a strong proof of concept for agentic research, but the real story is scoping: once the task is bounded by a single metric, a single file, and a hard time budget, the agent can do useful work. It looks less like an autonomous scientist and more like a very fast ablation engine that still needs a human to set the question.

–`program.md` is the real control surface, effectively acting like a lightweight operating system for the agent.
–The sandbox and permission lock mattered as much as the model, because they kept the loop safe and reviewable.
–The biggest gain came from a temperature clamp bug fix, which says a lot about how much low-hanging fruit still hides in research code.
–Hyperparameter tuning delivered more value than architectural changes, which is exactly the kind of search current agents are good at.
–Once the exploration moved into moonshot ideas, success dropped sharply, showing the ceiling of today’s autonomous research loops.

// TAGS

autoresearchagentresearchai-codingautomationopen-source

DISCOVERED

112d ago

2026-03-23

PUBLISHED

112d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

ykumards

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL17m ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.

UPDATE1h ago

OpenRouter splits rankings by model weight

OpenRouter has updated its rankings platform by introducing separate leaderboards for open-weight and closed-weight models. This allows developers to track and compare usage statistics of proprietary, API-exclusive models against downloadable open-weight models.

UPDATE1h ago

Codex and Claude Code introduce advanced in-app browser capabilities, including multi-tab support and cookie imports, accelerating the shift toward autonomous computer use.

Codex has updated its in-app browser to support multiple tabs, cookie importing, and password persistence, with Anthropic's Claude Code quickly following with similar web-browsing capabilities. These upgrades allow AI agents to navigate authenticated sites and perform browser-based tasks alongside code editors and terminals. By embedding robust browser control directly into the agentic environment, developers can execute end-to-end workflows without leaving the command line or workspace app.