That is the third article in a collection on agentic engineering and AI-driven growth. Learn half one right here, half two right here, and search for the following article on April 15 on O’Reilly Radar.
The toolkit sample is a manner of documenting your challenge’s configuration in order that any AI can generate working inputs from a plain-English description. You and the AI create a single file that describes your instrument’s configuration format, its constraints, and sufficient labored examples that any AI can generate working inputs from a plain-English description. You construct it iteratively, working with the AI (or, higher, a number of AIs) to draft it. You take a look at it by beginning a contemporary AI session and attempting to make use of it, and each time that fails you develop the toolkit from these failures. While you construct the toolkit nicely, your customers won’t ever have to learn the way your instrument’s configuration information work, as a result of they describe what they need in dialog and the AI handles the interpretation. Meaning you don’t must compromise on the way in which your challenge is configured, as a result of the config information will be extra complicated and extra full than they’d be if a human needed to edit and perceive them.
To grasp why all of this issues, let me take you again to the mid-Nineteen Eighties.
I used to be 12 years outdated, and our household obtained an AT&T PC 6300, an IBM-compatible that got here with a consumer’s information roughly 159 pages lengthy. Chapter 4 of that guide was referred to as “What Each Person Ought to Know.” It lined issues like use the keyboard, care to your diskettes, and, memorably, label them, full with hand-drawn illustrations and actually helpful recommendation, like how it is best to solely use felt-tipped pens, by no means ballpoint, as a result of the strain may harm the magnetic floor.
I keep in mind being fascinated by this guide. It wasn’t our first pc. I’d been writing BASIC packages and dialing into BBSs and CompuServe for a few years, so I knew there have been all kinds of wonderful issues you might do with a PC, particularly one with a blazing quick 8MHz processor. However the guide barely talked about any of that. That appeared actually bizarre to me, whilst a child, that you’d give somebody a guide that had an entire web page on utilizing the backspace key to right typing errors (actually!) however didn’t really inform them use the factor to do something helpful.
That’s how most developer documentation works. We write the stuff that’s simple to write down—set up, setup, the getting-started information—as a result of it’s rather a lot simpler than writing the stuff that’s really arduous: the deep clarification of how all of the items match collectively, the constraints you solely uncover by hitting them, the patterns that separate a configuration that works from one that just about works. That is one more “in search of your keys below the streetlight” drawback: We write the documentation we write as a result of it’s best to write down, even when it’s not likely the documentation our customers want.
Builders who got here up via the Unix period know this nicely. Man pages had been thorough, correct, and sometimes utterly impenetrable should you didn’t already know what you had been doing. The tar man web page is the canonical instance: It paperwork each flag and possibility in exhaustive element, however should you simply need to know extract a .tar.gz file, it’s virtually ineffective. (The fitting flag is -xzvf in case you’re curious.) Stack Overflow exists largely as a result of man pages like tar’s left a niche between what the documentation mentioned and what builders really wanted to know.
And now we’ve AI assistants. You may ask Claude or ChatGPT about, say, Kubernetes, Terraform, or React, and also you’ll really get helpful solutions, as a result of these are all established initiatives which were written about extensively and the coaching information is in all places.
However AI hits a tough wall on the boundary of its coaching information. If you happen to’ve constructed one thing new—a framework, an inside platform, a instrument your crew created—no mannequin has ever seen it. Your customers can’t ask their AI assistant for assist, as a result of the AI doesn’t know your factor even exists.
There’s been plenty of nice work transferring AI documentation in the correct route. AGENTS.md tells AI coding brokers work in your codebase, treating the AI as a developer. llms.txt provides fashions a structured abstract of your exterior documentation, treating the AI as a search engine. What’s been lacking is a observe for treating the AI as a help engineer. Each challenge wants configuration: enter information, possibility schemas, workflow definitions, normally within the kind of an entire bunch of JSON or YAML information with cryptic codecs that customers must study earlier than they’ll do something helpful.
The toolkit sample solves that drawback of getting AIs to write down configuration information for a challenge that isn’t in its coaching information. It consists of a documentation file that teaches any AI sufficient about your challenge’s configuration that it could actually generate working inputs from a plain-English description, with out your customers ever having to study the format themselves. Builders have been arriving at this similar sample (or one thing very comparable) independently from totally different instructions, however so far as I can inform, no person has named it or described a strategy for doing it nicely. This text distills what I realized from constructing the toolkit for Octobatch pipelines right into a set of practices you may apply to your personal initiatives.
Construct the AI its personal guide
Historically, builders face a trade-off with configuration: maintain it easy and simple to grasp, or let it develop to deal with actual complexity and settle for that it now requires a guide. The toolkit sample emerged for me whereas I used to be constructing Octobatch, the batch-processing orchestrator I’ve been writing about on this collection. As I described within the earlier articles on this collection, “The Unintended Orchestrator” and “Maintain Deterministic Work Deterministic,” Octobatch runs complicated multistep LLM pipelines that generate information or run Monte Carlo simulations. Every pipeline is outlined utilizing a fancy configuration that consists of YAML, Jinja2 templates, JSON schemas, expression steps, and a algorithm tying all of it collectively. The toolkit sample let me sidestep that conventional trade-off.
As Octobatch grew extra complicated, I discovered myself counting on the AIs (Claude and Gemini) to construct configuration information for me, which turned out to be genuinely helpful. After I developed a brand new characteristic, I’d work with the AIs to give you the configuration construction to help it. At first I outlined the configuration, however by the tip of the challenge I relied on the AIs to give you the primary minimize, and I’d push again when one thing appeared off or not forward-looking sufficient. As soon as all of us agreed, I’d have an AI produce the precise up to date config for no matter pipeline we had been engaged on. This transfer to having the AIs do the heavy lifting of writing the configuration was actually helpful, as a result of it let me create a really sturdy format in a short time with out having to spend hours updating present configurations each time I modified the syntax or semantics.
Sooner or later I noticed that each time a brand new consumer needed to construct a pipeline, they confronted the identical studying curve and implementation challenges that I’d already labored via with the AIs. The challenge already had a README.md file, and each time I modified the configuration I had an AI replace it to maintain the documentation updated. However by this time, the README.md file was doing manner an excessive amount of work: It was actually complete however an actual headache to learn. It had eight separate subdocuments displaying the consumer do just about every part Octobatch supported, and the majority of it was centered on configuration, and it was changing into precisely the type of documentation no person ever desires to learn. That significantly bothered me as a author; I’d produced documentation that was genuinely painful to learn.
Wanting again at my chats, I can hint how the toolkit sample developed. My first intuition was to construct an AI-assisted editor. About 4 weeks into the challenge, I described the thought to Gemini:
I’m eager about present any type of AI-assisted instrument to assist folks create their very own pipeline. I used to be eager about a characteristic we’d name “Octobatch Studio” the place we make it simple to immediate for modifying pipeline phases, probably aiding in creating the prompts. However perhaps as a substitute we embrace plenty of documentation in Markdown information, and count on them to make use of Claude Code, and provides plenty of steerage for creating it.
I can really see the pivot to the toolkit sample occurring in actual time on this later message I despatched to Claude. It had sunk in that my customers might use Claude Code, Cursor, or one other AI as interactive documentation to construct their configs precisely the identical manner I’ve been doing:
My plan is to make use of Claude Code because the IDE for creating new pipelines, so individuals who need to create them can simply spin up Claude Code and begin producing them. Meaning we have to give Claude Code particular context information to inform it every part it must know to create the pipeline YAML config with asteval expressions and Jinja2 template information.
The standard trade-off between simplicity and adaptability comes from cognitive overhead: the price of holding all of a system’s guidelines, constraints, and interactions in your head when you work with it. It’s why many builders go for less complicated config information, so that they don’t overload their customers (or themselves). As soon as the AI was writing the configuration, that trade-off disappeared. The configs might get as difficult as they wanted to be, as a result of I wasn’t the one who needed to keep in mind how all of the items match collectively. Sooner or later I noticed the toolkit sample was price standardizing.
That toolkit-based workflow—customers describe what they need, the AI reads TOOLKIT.md and generates the config—is the core of the Octobatch consumer expertise now. A consumer clones the repo and opens Claude Code, Cursor, or Copilot, the identical manner they’d with any open supply challenge. Each configuration immediate begins the identical manner: “Learn pipelines/TOOLKIT.md and use it as your information.” The AI reads the file, understands the challenge construction, and guides them step-by-step.
To see what this appears to be like like in observe, take the Drunken Sailor pipeline I described in “The Unintended Orchestrator.” It’s a Monte Carlo random stroll simulation: A sailor leaves a bar and stumbles randomly towards the ship or the water. The pipeline configuration for that entails a number of YAML information, JSON schemas, Jinja2 templates, and expression steps with actual mathematical logic, all wired along with particular guidelines.

Right here’s the immediate that generated all of that. The consumer describes what they need in plain English, and the AI produces the complete configuration by studying TOOLKIT.md. That is the precise immediate I gave Claude Code to generate the Drunken Sailor pipeline—discover the primary line of the immediate, telling it to learn the toolkit file.

However configuration technology is barely half of what the toolkit file does. Customers can even add TOOLKIT.md and PROJECT_CONTEXT.md (which has details about the challenge) to any AI assistant—ChatGPT, Gemini, Claude, Copilot, no matter they like—and use it as interactive documentation. A pipeline run completed with validation failures? Add the 2 information and ask what went mistaken. Caught on how retries work? Ask. You may even paste in a screenshot of the TUI and say, “What do I do?” and the AI will learn the display screen and provides particular recommendation. The toolkit file turns any AI into an on-demand help engineer to your challenge.

What the Octobatch challenge taught me concerning the toolkit sample
Constructing the generative toolkit for Octobatch produced extra than simply documentation that an AI might use to create configuration information that labored; it additionally yielded a set of practices, and people practices turn into fairly constant no matter what sort of challenge you’re constructing. Listed here are the 5 that mattered most:
- Begin with the toolkit file and develop it from failures. Don’t wait till the challenge is completed to write down the documentation. Create the toolkit file first, then let every actual failure add one precept at a time.
- Let the AI write the config information. Your job is product imaginative and prescient—what the challenge ought to do and the way it ought to really feel. The AI’s job is translating that into legitimate configuration.
- Maintain steerage lean. State the precept, give one concrete instance, transfer on. Each guardrail prices tokens, and bloated steerage makes AI efficiency worse.
- Deal with each use as a take a look at. There’s no separate testing section for documentation. Each time somebody makes use of the toolkit file to construct one thing, that’s a take a look at of whether or not the documentation works.
- Use multiple mannequin. Totally different fashions catch various things. In a three-model audit of Octobatch, three-quarters of the defects had been caught by just one mannequin.
I’m not proposing a normal format for a toolkit file, and I believe attempting to create one can be counterproductive. Configuration codecs differ wildly from instrument to instrument—that’s the entire drawback we’re attempting to unravel—and a toolkit file that describes your challenge’s constructing blocks goes to look utterly totally different from one which describes another person’s. What I discovered is that the AI is completely able to studying no matter you give it, and might be higher at writing the file than you’re anyway, as a result of it’s writing for an additional AI. These 5 practices ought to assist construct an efficient toolkit no matter what your challenge appears to be like like.
Begin with the toolkit file and develop it from failures
You can begin constructing a toolkit at any level in your challenge. The best way it occurred for me was natural: After weeks of working with Claude and Gemini on Octobatch configuration, the information about what labored and what didn’t was scattered throughout dozens of chat classes and context information. I wrote a immediate asking Gemini to consolidate every part it knew concerning the config format—the construction, the foundations, the constraints, the examples, every part we’d talked about—right into a single TOOLKIT.md file. That first model wasn’t nice, but it surely was a place to begin, and each failure after that made it higher.
I didn’t plan the toolkit from the start of the Octobatch challenge. It began as a result of I needed my customers to have the ability to construct pipelines the identical manner I had—by working with an AI—however every part they’d want to try this was unfold throughout months of chat logs and the CONTEXT.md information I’d been sustaining to bootstrap new growth classes. As soon as I had Gemini consolidate every part right into a single TOOLKIT.md file and had Claude assessment it, I handled it the way in which I deal with another code: Each time one thing broke, I discovered the basis trigger, labored with the AIs to replace the toolkit to account for it, and verified {that a} contemporary AI session might nonetheless use it to generate legitimate configuration.
That incremental method labored nicely for me, and it let me take a look at my toolkit the way in which I take a look at another code: strive it out, discover bugs, repair them, rinse, repeat.
You are able to do the identical factor. If you happen to’re beginning a brand new challenge, you may plan to create the toolkit on the finish. But it surely’s more practical to start out with a easy model early and let it emerge over the course of growth. That manner you’re dogfooding it the entire time as a substitute of guessing what customers will want.
Let the AI write the config information (however keep in management!)
Early Octobatch pipelines had easy sufficient configuration {that a} human might learn and perceive them, however not as a result of I used to be writing them by hand. One of many floor guidelines I set for the Octobatch experiment in AI-driven growth was that the AIs would write all the code, and that included writing all the configuration information. The issue was that despite the fact that they had been doing the writing, I used to be unconsciously constraining the AIs: pushing again on something that felt too complicated, steering towards constructions I might nonetheless maintain in my head.
Sooner or later I noticed my pushback was inserting a man-made restrict on the challenge. The entire level of getting AIs write the config was that I didn’t have to maintain each single line in my head—it was okay to let the AIs deal with that degree of complexity. As soon as I finished constraining them, the cognitive overhead restrict I described earlier went away. I might have full pipelines outlined in config, together with expression steps with actual mathematical logic, without having to carry all the foundations and relationships in my head.
As soon as the challenge actually obtained rolling, I by no means wrote YAML by hand once more. The cycle was at all times: want a characteristic, talk about it with Claude and Gemini, push again when one thing appeared off, and one in every of them produces the up to date config. My job was product imaginative and prescient. Their job was translating that into legitimate configuration. And each config file they wrote was one other take a look at of whether or not the toolkit really labored.
This job delineation, nonetheless, meant inevitable disagreements between me and the AI, and it’s not at all times simple to seek out your self disagreeing with a machine as a result of they’re surprisingly cussed (and sometimes shockingly silly). It required persistence and vigilance to remain in charge of the challenge, particularly after I turned over giant duties to the AIs.
The AIs constantly optimized for technical correctness—separation of considerations, code group, effort estimation—which was nice, as a result of that’s the job I requested them to do. I optimized for product worth. I discovered that protecting that worth as my north star and at all times specializing in constructing helpful options constantly helped with these disagreements.
Maintain steerage lean
When you begin rising the toolkit from failures, the pure development is to overdocument every part. Generative AIs are biased towards producing, and it’s simple to allow them to get carried away with it. Each bug feels prefer it deserves a warning, each edge case feels prefer it wants a caveat, and earlier than lengthy your toolkit file is bloated with guardrails that price tokens with out including a lot worth. And for the reason that AI is the one writing your toolkit updates, you should push again on it the identical manner you push again on structure choices. AIs love including WARNING blocks and exhaustive caveats. The self-discipline you should convey is telling them when to not add one thing.
The fitting degree is to state the precept, give one concrete instance, and belief the AI to use it to new conditions. When Claude Code made a selection about JSON schema constraints that I might need second-guessed, I needed to determine whether or not so as to add extra guardrails to TOOLKIT.md. The reply was no—the steerage was already there, and the selection it made was really right. If you happen to maintain tightening guardrails each time an AI makes a judgment name, the sign will get misplaced within the noise and efficiency will get worse, not higher. When one thing goes mistaken, the impulse—for each you and the AI—is so as to add a WARNING block. Resist it. One precept, one instance, transfer on.
Deal with each use as a take a look at
There was no separate “testing section” for Octobatch’s TOOLKIT.md. Each pipeline that I created with it was a brand new take a look at. After the very first model, I opened a contemporary Claude Code session that had by no means seen any of my growth conversations, pointed it on the newly minted TOOLKIT.md, and requested it to construct a pipeline. The primary time I attempted it, I used to be shocked at how nicely it labored! So I stored utilizing it, and because the challenge rolled alongside, I up to date it with each new characteristic and examined these updates. When one thing failed, I traced it again to a lacking or unclear rule within the toolkit and glued it there.
That’s the sensible take a look at for any toolkit: open a contemporary AI session with no context past the file, describe what you need in plain English, and see if the output works. If it doesn’t, the toolkit has a bug.
Use multiple mannequin
While you’re constructing and testing your toolkit, don’t simply use one AI. Run the identical job via a second mannequin. An excellent sample that labored for me was constantly having Claude generate the toolkit and Gemini examine its work.
Totally different fashions catch various things, and this issues for each growing and testing the toolkit. I used Claude and Gemini collectively all through Octobatch growth, and I overruled each once they had been mistaken about product intent. You are able to do the identical factor: If you happen to work with a number of AIs all through your challenge, you’ll begin to get a really feel for the totally different sorts of questions they’re good at answering.
When you have got a number of fashions generate config from the identical toolkit independently, you discover out quick the place your documentation is ambiguous. If two fashions interpret the identical rule in another way, the rule wants rewriting. That’s a sign you may’t get from utilizing only one mannequin.
The guide, revisited
That AT&T PC 6300 guide devoted a full web page to labeling diskettes, which can have been overkill, but it surely obtained one factor proper: it described the constructing blocks and trusted the reader to determine the remaining. It simply had the mistaken reader in thoughts.
The toolkit sample is identical concept, pointed at a unique viewers. You write a file that describes your challenge’s configuration format, its constraints, and sufficient labored examples that any AI can generate working inputs from a plain-English description. Your customers by no means must study YAML or memorize your schema, as a result of they’ve a dialog with the AI and it handles the interpretation.
If you happen to’re constructing a challenge and also you need AI to have the ability to assist your customers, begin right here: write the toolkit file earlier than you write the README, develop it from actual failures as a substitute of attempting to plan all of it upfront, maintain it lean, take a look at it through the use of it, and use multiple mannequin as a result of no single AI catches every part.
The AT&T guide’s Chapter 4 was referred to as “What Each Person Ought to Know.” Your toolkit file is “What Each AI Ought to Know.” The distinction is that this time, the reader will really use it.
Within the subsequent article, I’ll begin with a statistic about developer belief in AI-generated code that turned out to be fabricated by the AI itself—and use that to clarify why I constructed a top quality playbook that revives the normal high quality practices most groups minimize many years in the past. It explores an unfamiliar codebase, generates a whole high quality infrastructure—assessments, assessment protocols, validation guidelines—and finds actual bugs within the course of. It really works throughout Java, C#, Python, and Scala, and it’s out there as an open supply Claude Code talent.
