# Autoresearch Github [here](https://github.com/karpathy/autoresearch) Autonomous ML research while you sleep. An [[AI agents|agent]] that modifies training code, runs experiments, evaluates results, and iterates. No human in the loop between cycles. Karpathy built it as open source, and the design is deliberately minimal. Three files. `prepare.py` holds constants and data prep (untouched). `train.py` is the only file the agent can edit: GPT model, optimizer, training loop. `program.md` is the research agenda in plain Markdown, written by the human. The agent reads the agenda, modifies `train.py`, trains for exactly 5 minutes, checks validation bits-per-byte, decides whether to keep or revert, and loops. Fixed time budget means roughly 12 experiments per hour, 100 overnight. One GPU. The constraint design is the interesting part. By limiting the agent to a single file and a 5-minute window, you get a tight feedback loop with a binary outcome: did the metric improve or not? This is exactly where [[model layer and above]] says AI excels. Tightly verifiable domains with binary outcomes. Math, code compilation, games, molecules. And now: experimental ML research. The agent doesn't need taste or judgment. It needs to try things and measure. ## Why This Pattern Matters This is [[domain specific sense-making]] automated. The traditional research loop (read widely, map ideas, synthesize, remix) still applies, but the "try and measure" part of ML research is now running at machine speed. The human retains the strategic layer: writing `program.md`, deciding what questions to ask. The agent handles execution. [[Wright’s Law]] should compound here. Each experiment the agent runs generates signal about what works and what doesn't. If the system logs results properly, experiment #100 should be informed by the 99 before it. The learning curve compresses with cumulative volume, same as [[Deployment Velocity]] in industrial AI. The open source angle maps to [[AI era Defensibility]]. Karpathy isn't building a product. He's setting a standard for how autonomous research loops get structured. [[Open source is often misunderstood]]: give away the core, let the ecosystem build on top. Every fork (macOS support, Windows, different model architectures) extends distribution without effort. ## The Verification Question The bottleneck is still verification. Bits-per-byte is clean and unambiguous for language model training. But most research domains don't have a single metric you can evaluate in 5 minutes. Drug discovery, materials science, climate modeling: the eval function itself is the hard problem. [[Domain Experts as Eval Builders]] becomes critical when you try to port this pattern beyond ML training loops. [[LLM-as-Judge]] offers one path. Use a model to evaluate another model's research output. But for subjective or multi-dimensional research, you're back to the core insight from [[model layer and above]]: the deeper arbitrage is making the subjective scorable. Until you can do that, autoresearch stays confined to domains with fast, binary feedback. ## Where This Sits [[AI agents]] distinguishes between reactive agents (respond to current stimuli) and learning agents (improve from experience). Autoresearch is a learning agent with an unusually tight loop. Most [[Autonomous Agents]] struggle with the gap between logical reasoning and execution. Here, the execution environment is so constrained (one file, one metric, one GPU) that the gap nearly disappears. The economics connect to [[The infrastructure layer and AI capex]]. A single GPU running overnight costs almost nothing. The research output, if the agent finds genuine improvements, could be worth orders of magnitude more. That's the leverage: human sets direction, machine explores the space at near-zero marginal cost. [[Sparsity x LLMs]] and inference cost compression only make this cheaper over time. The bigger pattern: research itself is becoming a compute problem. Not replacing the researcher's judgment, but replacing the researcher's hands. The human writes the agenda. The machine runs the experiments. This is [[PKM System]] logic applied to ML: keep ideas moving, let the system do the repetitive synthesis, focus human attention on the creative leaps. Links: - [[AI agents]] - [[Autonomous Agents]] - [[model layer and above]] - [[domain specific sense-making]] - [[Wright's Law]] - [[Deployment Velocity]] - [[AI era Defensibility]] - [[Open source is often misunderstood]] - [[Domain Experts as Eval Builders]] - [[LLM-as-Judge]] - [[The infrastructure layer and AI capex]] - [[Sparsity x LLMs]] - [[PKM System]] - [[Selling AI MOC]] - [[Industrial AI MOC]] --- Tags: #deeptech #systems #kp