Autoresearch lets an AI agent run autonomous ML experiments while you sleep. You write instructions in a program.md file, and the agent modifies training code, runs 5-minute experiments, measures results, and keeps only the improvements β automatically. Expect ~12 experiments/hour, or ~100 overnight.
| *Sources: karpathy/autoresearch on GitHub | uditgoenka/autoresearch (general-purpose fork) | autoresearch skill on LobeHub | Deep dive by Ken Huang | Ole Lehmann: AutoResearch for Skills* |
How It Works
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β program.md ββββββΆβ AI Agent ββββββΆβ train.py β
β (you write) β β (Claude/ β β (agent edits)β
β β β Codex) β β β
βββββββββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β
β ββββββββΌββββββββ
β β 5-min train β
β β + validate β
β ββββββββ¬ββββββββ
β β
ββββββββΌββββββββ ββββββββΌββββββββ
β Better? ββββββββ val_bpb β
β Yes β commitβ β (lower= β
β No β revert β β better) β
ββββββββββββββββ ββββββββββββββββ
β
Loop repeats all night
The key insight: you donβt touch Python. You βprogramβ the agent by writing program.md β high-level instructions about what to explore. The agent does all the code modifications.
The Three Files
| File | Who edits | Purpose |
|---|---|---|
program.md |
You | Research direction, strategy, constraints for the agent |
train.py |
Agent | GPT model, optimizer, hyperparameters β everything is fair game |
prepare.py |
Nobody | Fixed constants, dataset prep, tokenizer β donβt touch |
Setup
Requirements
- Single NVIDIA GPU (tested on H100) or Apple Silicon Mac (via MLX fork)
- Python 3.10+
uvpackage manager
Installation
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repo
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
# Install dependencies
uv sync
# Prepare data and tokenizer (~2 minutes, one-time)
uv run prepare.py
# Test with a single training run (~5 minutes)
uv run train.py
Kick Off Experiments
Open Claude Code in the repo directory and say:
> Have a look at program.md and let's kick off a new experiment!
Claude reads your instructions, modifies train.py, runs training, evaluates the result, and loops.
Writing program.md
This is where you guide the agent. Think of it as programming in English instead of Python:
# Research Program
## Current Goal
Explore whether a mixture-of-experts architecture improves
val_bpb on our tiny language model.
## Constraints
- Keep total parameter count under 50M
- Each experiment must complete in the 5-minute window
- Don't change the tokenizer or dataset
## Strategy
1. Start with a simple 2-expert MoE layer replacing the FFN
2. If that helps, try 4 experts with top-2 routing
3. Experiment with expert capacity factor
## What NOT to do
- Don't increase context length (won't fit in 5 minutes)
- Don't try distributed training
Real-World Use Cases
Hyperparameter Search
Let the agent systematically explore learning rates, batch sizes, warmup schedules, and weight decay overnight.
Architecture Prototyping
Test variations on attention mechanisms, FFN designs, normalization strategies, or positional encodings β all within 5-minute experiments.
Overnight Exploration
Go to sleep with a research question, wake up with 100 data points and a clear winner committed to git.
Reproducible Research
Every experiment is a git commit with the exact code that produced it. Failures are reverted. Your git log becomes your experiment log.
Important Things to Know
Fixed 5-Minute Budget
Every experiment runs for exactly 5 minutes regardless of hardware. This ensures fair comparison across architectural changes and enables ~12 experiments/hour.
Git Is Your Lab Notebook
- Improvement β
git commit(change is kept) - No improvement β
git revert(change is discarded) - Your git history becomes a clean record of what worked
You Program in Markdown, Not Python
The whole point is that you stay at the strategy level. Let the agent handle implementation details. If you find yourself editing train.py directly, youβre doing it wrong.
Works with Claude Code
The original project was designed for any LLM agent, but Claude Code is a natural fit β just open a session in the repo, point Claude to program.md, and let it run. Combine with /loop for extra automation.
Community Forks
- Apple Silicon (MLX): autoresearch-mlx β no PyTorch required
- Multi-agent: autoautoresearch β parallel agent swarms
Beyond ML: AutoResearch for General Software Engineering
Karpathyβs AutoResearch was designed for ML experiments, but the core loop β modify β verify β keep/discard β repeat β works for anything that can be scored. uditgoenka/autoresearch adapts this pattern to general software engineering tasks in Claude Code.
The General-Purpose Loop
Define scoring criteria (3-6 yes/no rubric items)
β
Agent makes a small change
β
Agent tests the result against criteria
β
Better? β keep. Worse? β discard.
β
Repeat forever. Go to sleep.
The key: you only need to define a scoring rubric. Not code, not architecture β just a checklist of yes/no criteria. Say βrun autoresearch on my landing page skillβ and it runs the whole process.
Real-World Results
| Use Case | Before | After | Rounds |
|---|---|---|---|
| Page load speed | 1,100ms | 67ms | 67 |
| Landing page copy accuracy | 56% | 92% | auto |
| Test coverage | 70% | 95% | auto |
| Bundle size | bloated | <200KB | auto |
| Lighthouse score | low | 90+ | auto |
Applicable Scenarios
- Performance optimization β Lighthouse scores, page load times, Core Web Vitals
- Bundle size reduction β keep restructuring code until frontend bundle is under target
- Code quality β raise unit test coverage from 70% to 95%
- CI/CD security β add security scanning to pipeline, iterate until no vulnerabilities
- Content optimization β cold emails, newsletter headers, landing page copy
- Claude Code Skills β auto-optimize any skill with measurable scoring criteria
How to Apply to Claude Skills
One practitioner turned this into a reusable pattern (ref):
- Put the AutoResearch logic into a Claude Code skill
- Point it at any other skill you want to optimize
- Define 3-6 yes/no scoring criteria, e.g.:
- Does the headline contain specific numbers or results?
- Is the copy free of marketing buzzwords (βrevolutionaryβ, βsynergyβ)?
- Does the CTA directly address a pain point?
- Is the opening sentence under 15 words?
- Say:
run autoresearch on my landing page skill - The agent loops: modify skill β test against criteria β keep/discard
The changelog becomes the most valuable output β it records what works and what doesnβt for that specific skill. When better models come out, hand the changelog to the new agent and it continues optimizing from where the last one left off.
How LearnAI Team Could Use This
This pattern teaches students a fundamental concept: anything with a measurable outcome can be optimized automatically. The humanβs job is defining what βgoodβ means (the rubric), not doing the optimization manually. This connects directly to:
- Formal verification β defining correctness criteria, then automating verification
- TDD β tests as automated success criteria
- The Claude Certified Architect examβs emphasis on verification loops
Who Made This
Open-sourced in March 2026 by Andrej Karpathy (former Tesla AI Director and OpenAI co-founder). The projectβs vision: frontier AI research will increasingly be conducted by βautonomous swarms of AI agentsβ rather than individual human researchers.