Autoresearch: 100 Autonomous ML Experiments Overnight

Autoresearch: 100 Autonomous ML Experiments Overnight

Autoresearch lets an AI agent run autonomous ML experiments while you sleep. You write instructions in a program.md file, and the agent modifies training code, runs 5-minute experiments, measures results, and keeps only the improvements β€” automatically. Expect ~12 experiments/hour, or ~100 overnight.

*Sources: karpathy/autoresearch on GitHub uditgoenka/autoresearch (general-purpose fork) autoresearch skill on LobeHub Deep dive by Ken Huang Ole Lehmann: AutoResearch for Skills*

How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ program.md  │────▢│  AI Agent    │────▢│  train.py    β”‚
β”‚ (you write) β”‚     β”‚ (Claude/     β”‚     β”‚ (agent edits)β”‚
β”‚             β”‚     β”‚  Codex)      β”‚     β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚                     β”‚
                           β”‚              β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                           β”‚              β”‚  5-min train  β”‚
                           β”‚              β”‚  + validate   β”‚
                           β”‚              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚                     β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Better?     │◀─────│  val_bpb     β”‚
                    β”‚  Yes β†’ commitβ”‚      β”‚  (lower=     β”‚
                    β”‚  No β†’ revert β”‚      β”‚   better)    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                    Loop repeats all night

The key insight: you don’t touch Python. You β€œprogram” the agent by writing program.md β€” high-level instructions about what to explore. The agent does all the code modifications.

The Three Files

File Who edits Purpose
program.md You Research direction, strategy, constraints for the agent
train.py Agent GPT model, optimizer, hyperparameters β€” everything is fair game
prepare.py Nobody Fixed constants, dataset prep, tokenizer β€” don’t touch

Setup

Requirements

  • Single NVIDIA GPU (tested on H100) or Apple Silicon Mac (via MLX fork)
  • Python 3.10+
  • uv package manager

Installation

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repo
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

# Install dependencies
uv sync

# Prepare data and tokenizer (~2 minutes, one-time)
uv run prepare.py

# Test with a single training run (~5 minutes)
uv run train.py

Kick Off Experiments

Open Claude Code in the repo directory and say:

> Have a look at program.md and let's kick off a new experiment!

Claude reads your instructions, modifies train.py, runs training, evaluates the result, and loops.

Writing program.md

This is where you guide the agent. Think of it as programming in English instead of Python:

# Research Program

## Current Goal
Explore whether a mixture-of-experts architecture improves
val_bpb on our tiny language model.

## Constraints
- Keep total parameter count under 50M
- Each experiment must complete in the 5-minute window
- Don't change the tokenizer or dataset

## Strategy
1. Start with a simple 2-expert MoE layer replacing the FFN
2. If that helps, try 4 experts with top-2 routing
3. Experiment with expert capacity factor

## What NOT to do
- Don't increase context length (won't fit in 5 minutes)
- Don't try distributed training

Real-World Use Cases

Let the agent systematically explore learning rates, batch sizes, warmup schedules, and weight decay overnight.

Architecture Prototyping

Test variations on attention mechanisms, FFN designs, normalization strategies, or positional encodings β€” all within 5-minute experiments.

Overnight Exploration

Go to sleep with a research question, wake up with 100 data points and a clear winner committed to git.

Reproducible Research

Every experiment is a git commit with the exact code that produced it. Failures are reverted. Your git log becomes your experiment log.

Important Things to Know

Fixed 5-Minute Budget

Every experiment runs for exactly 5 minutes regardless of hardware. This ensures fair comparison across architectural changes and enables ~12 experiments/hour.

Git Is Your Lab Notebook

  • Improvement β†’ git commit (change is kept)
  • No improvement β†’ git revert (change is discarded)
  • Your git history becomes a clean record of what worked

You Program in Markdown, Not Python

The whole point is that you stay at the strategy level. Let the agent handle implementation details. If you find yourself editing train.py directly, you’re doing it wrong.

Works with Claude Code

The original project was designed for any LLM agent, but Claude Code is a natural fit β€” just open a session in the repo, point Claude to program.md, and let it run. Combine with /loop for extra automation.

Community Forks

Beyond ML: AutoResearch for General Software Engineering

Karpathy’s AutoResearch was designed for ML experiments, but the core loop β€” modify β†’ verify β†’ keep/discard β†’ repeat β€” works for anything that can be scored. uditgoenka/autoresearch adapts this pattern to general software engineering tasks in Claude Code.

The General-Purpose Loop

Define scoring criteria (3-6 yes/no rubric items)
     ↓
Agent makes a small change
     ↓
Agent tests the result against criteria
     ↓
Better? β†’ keep. Worse? β†’ discard.
     ↓
Repeat forever. Go to sleep.

The key: you only need to define a scoring rubric. Not code, not architecture β€” just a checklist of yes/no criteria. Say β€œrun autoresearch on my landing page skill” and it runs the whole process.

Real-World Results

Use Case Before After Rounds
Page load speed 1,100ms 67ms 67
Landing page copy accuracy 56% 92% auto
Test coverage 70% 95% auto
Bundle size bloated <200KB auto
Lighthouse score low 90+ auto

Applicable Scenarios

  • Performance optimization β€” Lighthouse scores, page load times, Core Web Vitals
  • Bundle size reduction β€” keep restructuring code until frontend bundle is under target
  • Code quality β€” raise unit test coverage from 70% to 95%
  • CI/CD security β€” add security scanning to pipeline, iterate until no vulnerabilities
  • Content optimization β€” cold emails, newsletter headers, landing page copy
  • Claude Code Skills β€” auto-optimize any skill with measurable scoring criteria

How to Apply to Claude Skills

One practitioner turned this into a reusable pattern (ref):

  1. Put the AutoResearch logic into a Claude Code skill
  2. Point it at any other skill you want to optimize
  3. Define 3-6 yes/no scoring criteria, e.g.:
    • Does the headline contain specific numbers or results?
    • Is the copy free of marketing buzzwords (β€œrevolutionary”, β€œsynergy”)?
    • Does the CTA directly address a pain point?
    • Is the opening sentence under 15 words?
  4. Say: run autoresearch on my landing page skill
  5. The agent loops: modify skill β†’ test against criteria β†’ keep/discard

The changelog becomes the most valuable output β€” it records what works and what doesn’t for that specific skill. When better models come out, hand the changelog to the new agent and it continues optimizing from where the last one left off.

How LearnAI Team Could Use This

This pattern teaches students a fundamental concept: anything with a measurable outcome can be optimized automatically. The human’s job is defining what β€œgood” means (the rubric), not doing the optimization manually. This connects directly to:

Who Made This

Open-sourced in March 2026 by Andrej Karpathy (former Tesla AI Director and OpenAI co-founder). The project’s vision: frontier AI research will increasingly be conducted by β€œautonomous swarms of AI agents” rather than individual human researchers.