I arrange an AI agent on a rented GPU, pointed it at a coaching script, and went to mattress. By morning it had run 40 experiments, improved validation loss by 5.9%, and lower reminiscence utilization from 44 GB to 17 GB. It additionally spent 4 hours chasing a bug {that a} linter launched behind its again. The agent by no means flagged it. I solely discovered as a result of the numbers stopped bettering and I began studying logs.
The setup was based mostly on Andrej Karpathy’s autoresearch undertaking: Give an agent one file it may possibly edit (practice.py), one metric to optimize (validation bits per byte), a hard and fast five-minute coaching finances per experiment, and Git for checkpointing. If an experiment beats the present finest, hold the commit. If not, revert. Loop perpetually. Karpathy’s personal run produced 700 experiments and 20 real enhancements throughout 48 hours, an 11% speedup on already-optimized code. Shopify’s Tobi Lütke pointed the identical sample at Liquid, their templating engine, and acquired 53% quicker rendering from 93 automated commits. The sample clearly works. The query is what breaks while you run it your self.
The primary failure: Brokers fixing brokers
Earlier than working autoresearch, I had a separate drawback. I had 15 customized abilities for Claude Code (assume reusable immediate templates with software entry, structured inputs, and particular behaviors). Most of them have been damaged when dispatched as parallel background brokers. Obscure descriptions meant the system couldn’t work out when to invoke them. Lacking software permissions brought about silent failures. Duplicate scopes between related abilities created routing confusion.
So I used the identical sample: dispatch background brokers in parallel, one per talent, every tasked with studying the talent definition, figuring out issues, and rewriting it. 13 out of 15 got here again improved. Descriptions acquired particular. Useless references to nonexistent information have been eliminated. Software permissions have been added. Two abilities have been left untouched as a result of the brokers couldn’t discover something incorrect with them. The entire batch took beneath an hour.
However right here’s what I didn’t anticipate. Three of the “improved” abilities had delicate regressions. One agent eliminated an AskUserQuestion gate that was there for a purpose, as a result of the gate’s function wasn’t documented and the agent learn it as pointless friction. One other agent rewrote a talent description so exactly that it stopped triggering on the fuzzy, misspelled queries actual customers truly kind. I caught these throughout guide evaluate, but when I had trusted the parallel output with out checking, three abilities would have silently degraded in manufacturing.
The second failure: The linter within the loop
Then I began the coaching loop. The agent labored via hyperparameters methodically. It halved the batch measurement early (experiment 4), which turned out to be the one greatest win: extra gradient steps in the identical five-minute window. It lowered mannequin depth from eight to seven layers, dropped weight decay from 0.2 to 0.05, and tuned the educational charge schedule. Every change was small. The cumulative impact was a 5.9% enchancment in validation loss and a 60% discount in peak GPU reminiscence.
Out of 40 experiments, the agent saved 9, discarded 28, and crashed three. That hold/discard ratio felt about proper. Most concepts don’t work. The purpose of automation isn’t to have higher concepts. It’s to attempt dangerous ones quicker.
Then the numbers plateaued. Experiments 30 via 38 produced nothing value holding. I began digging via the logs and located one thing I hadn’t anticipated: A linter working on the distant machine had been silently modifying a hyperparameter in practice.py. It modified SCALAR_LR from 0.5 to 0.3 each time the agent saved the file. The agent would set the worth, commit, and run the experiment, however the linter would alter the file between the save and the execution. The agent had no option to detect this as a result of it checked Git diffs, not the runtime state of the file. Each experiment after a sure level was working with a studying charge the agent by no means selected.
I misplaced roughly 4 hours of compute to this. The agent saved going, proposing new concepts, working experiments, logging outcomes. From its perspective nothing was incorrect. The experiments ran, produced numbers, and the numbers have been believable. There was no crash, no error, no alert.
Why this issues past my GPU invoice
Gartner predicts over 40% of agentic AI tasks might be canceled by the top of 2027, citing escalating prices and insufficient threat controls as the first drivers. My in a single day session was a toy instance: a single GPU, a small mannequin, and a low-stakes experiment. However the failure sample scales. An agent that may’t detect when its inputs are being modified between selections will make the identical class of error whether or not it’s tuning hyperparameters or managing a manufacturing pipeline.
The autoresearch constraints are good: one file, one metric, and Git for state. However they assume the surroundings is steady. No one checks whether or not one thing outdoors the loop is modifying the file between commits. The agent optimizes inside its sandbox, and the sandbox has a gap within the wall that no person thought to search for.
Anybody who has run distributed methods acknowledges this. When the linter modified that hyperparameter, it was the equal of somebody enhancing a database report between a learn and a write. We solved that drawback years in the past with compare-and-swap, optimistic locking, checksums. We simply haven’t introduced any of it to autonomous AI workflows. The SkyPilot group lately scaled autoresearch to 16 GPUs and 910 experiments. At that scale, an undetected surroundings mutation doesn’t value you 4 hours. It prices you a cluster.
Subsequent time I run autoresearch, I’ll add a file integrity test earlier than each experiment. It’s three traces of code, however it might have saved me 4 hours and produced a greater remaining outcome. The agent did its job. The surroundings didn’t.
