Thursday, March 19, 2026

Preserve Deterministic Work Deterministic – O’Reilly


That is the second article in a collection on agentic engineering and AI-driven improvement. Learn half one right here, and search for the following article on April 2 on O’Reilly Radar.

The primary 90 % of the code accounts for the primary 90 % of the event time. The remaining 10 % of the code accounts for the opposite 90 % of the event time.
Tom Cargill, Bell Labs

One of many experiments I’ve been operating as a part of my work on agentic engineering and AI-driven improvement is a blackjack simulation the place an LLM performs a whole bunch of fingers in opposition to blackjack methods written in plain English. The AI makes use of these technique descriptions to determine how one can make hit/stand/double-down selections for every hand, whereas deterministic code offers the playing cards, checks the mathematics, and verifies that the foundations have been adopted accurately.

Early runs of my simulation had a 37% cross charge. The LLM would add up card totals incorrect, skip the seller’s flip totally, or ignore the technique it was purported to observe. The large downside was that these errors compounded: If the mannequin miscounted the participant’s whole on the third card, each determination after that was primarily based on a incorrect quantity, so the entire hand was rubbish even when the remainder of the logic was effective.

There’s a helpful method to consider reliability issues like that: the March of Nines. Getting an LLM-based system to 90% reliability is the primary 9, and it’s the “simple” one. Getting from 90% to 99% takes roughly the identical quantity of engineering effort. So does getting from 99% to 99.9%. Every 9 prices about as a lot because the final, and also you by no means cease marching. Andrej Karpathy coined the time period from his expertise constructing self-driving methods at Tesla, the place they spent years incomes two or three nines and nonetheless had extra to go.

Right here’s a small train that exhibits how that sort of failure compounding works. Open any AI chatbot operating an early 2026 mannequin (I used ChatGPT 5.3 Immediate) and paste the next eight prompts one by one, every in a separate message. Go forward, I’ll wait.

Immediate 1: Monitor a operating “rating” by way of a 7-step sport. Don’t use code, Python, or instruments. Do that totally in your head. For every step, I offers you a sentence and a rule.

CRITICAL INSTRUCTION: You could reply with ONLY the mathematical equation displaying the way you up to date the rating. Instance format: 10 + 5 = 15 or 20 / 2 = 10. Don’t checklist the phrases you counted, don’t clarify your reasoning, and don’t write another textual content. Simply the equation.

Begin with a rating of 10. I’ll provide the first step within the subsequent immediate.

Immediate 2: “The sudden blizzard chilled the small village communities.” Add the variety of phrases containing double letters (two of the very same letter back-to-back, like ‘tt’ or ‘mm’).

Immediate 3: “The intelligent engineer wanted seven excellent items of cheese.” In case your rating is ODD, add the variety of phrases that comprise EXACTLY two ‘e’s. In case your rating is EVEN, subtract the variety of phrases that comprise EXACTLY two ‘e’s. (Don’t depend phrases with one, three, or zero ‘e’s).

Immediate 4: “The great sailor joined the keen crew aboard the picket boat.” In case your rating is bigger than 10, subtract the variety of phrases containing consecutive vowels (two totally different or an identical vowels back-to-back, like ‘ea’, ‘oo’, or ‘oi’). In case your rating is 10 or much less, multiply your rating by this quantity.

Immediate 5: “The fast brown fox jumps over the lazy canine.” Add the variety of phrases the place the THIRD letter is a vowel (a, e, i, o, u).

Immediate 6: “Three courageous kings stand underneath black skies.” In case your rating is an ODD quantity, subtract the variety of phrases which have precisely 5 letters. In case your rating is an EVEN quantity, multiply your rating by the variety of phrases which have precisely 5 letters.

Immediate 7: “Look down, you shy owl, go fly away.” Subtract the variety of phrases that comprise NONE of those letters: a, e, or i.

Immediate 8: “Inexperienced apples fall from tall bushes.” In case your rating is bigger than 15, subtract the variety of phrases containing the letter ‘a’. In case your rating is 15 or much less, add the variety of phrases containing the letter ‘l’.

The train tracks a operating rating by way of seven steps. Every step provides the mannequin a sentence and a counting rule, and the rating carries ahead. The proper remaining rating is 60. Right here’s the reply key: begin at 10, then 16 (10+6), 12 (16−4), 5 (12−7), 10 (5+5), 70 (10×7), 63 (70−7), 60 (63−3).

I ran this twice on the identical time (utilizing ChatGPT 5.3 Immediate), and received two utterly totally different incorrect solutions the primary time I attempted it. Neither run reached the proper rating of 60:

Step Right Run 1 (transcript) Run 2 (transcript)
1. Double letters 10 + 6 = 16 10 + 2 = 12 ❌ 10 + 5 = 15 ❌
2. Precisely two ‘e’s 16 − 4 = 12 12 − 4 = 8 ❌ 15 + 4 = 19 ❌
3. Consecutive vowels 12 − 7 = 5 8 × 7 = 56 ❌ 19 − 5 = 14 ❌
4. Third letter vowel 5 + 5 = 10 56 + 5 = 61 ❌ 14 + 3 = 17 ❌
5. Precisely 5 letters 10 × 7 = 70 61 − 7 = 54 ❌ 17 − 4 = 13 ❌
6. No a, e, or i 70 − 7 = 63 54 − 7 = 47 ❌ 13 − 3 = 10 ❌
7. Phrases with ‘a’ or ‘i’ 63 − 3 = 60 47 − 3 = 44 ❌ 10 + 4 = 14 ❌

The 2 runs inform very totally different tales. In Run 1, the mannequin miscounted in Step 1 (discovered 2 double-letter phrases as an alternative of 6) however really received the later counts proper. It didn’t matter. The incorrect rating in Step 1 flipped a department in Step 3, triggering a multiply as an alternative of a subtract, and the rating by no means recovered. One early mistake threw off the whole chain, regardless that the mannequin was doing good work after that.

Run 2 was a catastrophe. The mannequin miscounted at nearly each step, compounding errors on prime of errors. It ended at 14 as an alternative of 60. That’s nearer to what Karpathy is describing with the March of Nines: Every step has its personal reliability ceiling, and the longer the chain, the upper the prospect that no less than one step fails and corrupts every part downstream.

What makes this insidious: Each runs look the identical from the surface. Every step produced a believable reply, and each runs produced remaining outcomes. With out the reply key (or some tedious handbook checking), you’d haven’t any method of figuring out that Run 1 was a near-miss derailed by a single early error and Run 2 was incorrect at almost each step. That is typical of any course of the place the output of 1 LLM name turns into the enter for the following one.

These failures don’t show the March of Nines itself—that’s particularly concerning the engineering effort to push reliability from 90% to 99% to 99.9%. (It’s attainable to breed the total compounding-reliability downside in a chat, however a immediate that did it reliably could be far too lengthy to place in an article.) As a substitute, I opted for a shorter train which you’ll be able to simply check out your self that demonstrates the underlying downside that makes the march so onerous: cascading failures. Every step asks the mannequin to depend letters inside phrases, which is deterministic work {that a} quick Python script handles completely. LLMs, alternatively, don’t really deal with phrases as strings of characters; they see them as tokens. Recognizing double letters means unpacking a token into its characters, and the mannequin will get that incorrect simply usually sufficient to reliably screw it up. I added branching logic the place every step’s consequence determines the following step’s operation, so a single miscount in Step 1 cascades by way of the whole sequence.

I additionally need to be clear about precisely what a deterministic model of this simulation appears like. Fortunately, the AI can assist us with that. Go to both run (or your individual) and paste another immediate into the chat:

Immediate 9: Now write a brief Python script that does precisely what you simply did: begin with a rating of 10, apply every of the seven guidelines to the seven sentences, and print the equation at every step.

Run the script. It ought to print the proper reply for each step, ending at 60. The identical AI that simply failed the train can write code that does it flawlessly, as a result of now it’s producing deterministic logic as an alternative of attempting to depend characters by way of its tokenizer.

Reproducing a cascading failure in a chat

I intentionally engineered the train earlier to offer you a strategy to expertise the cascading failure downside behind the March of Nines your self. I took benefit of one thing present LLMs genuinely suck at: parsing characters inside tokens. Future fashions may do a significantly better job with this particular sort of failure, however the cascading failure downside doesn’t go away when the mannequin will get smarter. So long as LLMs are nondeterministic, any step that depends on them has a reliability ceiling under 100%, and people ceilings nonetheless multiply. The precise weak point modifications; the mathematics doesn’t.

I additionally particularly requested the mannequin to point out solely the equation and skip all intermediate reasoning to forestall it from utilizing chain of thought (or CoT) to self-correct. Chain of thought is a way the place you require the mannequin to point out its work step-by-step (for instance, itemizing the phrases it counted and explaining why each qualifies), which helps it catch its personal errors alongside the best way. CoT is a typical method to enhance LLM accuracy, and it really works. As you’ll see later once I speak concerning the evolution of my blackjack simulation, CoT lower sure errors roughly in half. However “half as many errors” remains to be not zero. Plus, it’s costly: It prices extra tokens and extra time. A Python script that counts double letters will get the appropriate reply on each run, immediately, for zero AI API prices (or, if you happen to’re operating the AI domestically, for orders of magnitude much less CPU utilization). That’s the core stress: You may spend engineering effort making the LLM higher at deterministic work, or you may simply hand it to code.

Each step on this train is deterministic work that code handles flawlessly. However most fascinating LLM duties aren’t like that. You may’t write a deterministic script that performs a hand of blackjack utilizing natural-language technique guidelines, or decides how a personality ought to reply in dialogue. Actual work requires chaining a number of steps collectively right into a pipeline, or a reproducible collection of steps (some deterministic, some requiring an LLM) that result in a single consequence, the place every step’s output feeds the following. If that appears like what you simply noticed within the train, it’s. Besides actual pipelines are longer, extra complicated, and far more durable to debug when one thing goes incorrect within the center.

LLM pipelines are particularly vulnerable to the March of Nines

I’ve been spending a number of time desirous about LLM pipelines, and I think I’m within the minority. Most individuals utilizing LLMs are working with single prompts or quick conversations. However when you begin constructing multistep workflows the place the AI generates structured information that feeds into the following step—whether or not that’s a content material era pipeline, a knowledge processing chain, or a simulation—you run straight into the March of Nines. Every step has a reliability ceiling, and people ceilings multiply. The train you simply tried had seven steps. The blackjack pipeline has extra, and I’ve been operating it a whole bunch of instances per iteration.

The blackjack pipeline in Octobatch, an open supply batch orchestrator for multistep LLM workflows that I launched in “The Unintended Orchestrator.”

That’s a screenshot of the blackjack pipeline in Octobatch, the device I constructed to run these pipelines at scale. That pipeline offers playing cards deterministically, asks the LLM to play every hand following a method described in plain English, then validates the outcomes with deterministic code. Octobatch makes it simple to vary the pipeline and rerun a whole bunch of fingers, which is how I iterated by way of eight variations—and the way I actually realized the onerous method that the March of Nines wasn’t only a theoretical downside however one thing I might watch occurring in actual time throughout a whole bunch of knowledge factors.

Operating pipelines at scale made the failures apparent and rapid, which, for me, actually underscored an efficient method to minimizing the cascading failure downside: make deterministic work deterministic. Meaning asking whether or not each step within the pipeline really must be an LLM name. Checking {that a} jack, a 5, and an eight add as much as 23 doesn’t require a language mannequin. Neither does wanting up whether or not standing on 15 in opposition to a seller 10 follows primary technique. That’s arithmetic and a lookup desk—work that unusual code does completely each time. And as I realized over the course of enhancing the failure charge for the pipeline, each step you pull out of the LLM and make deterministic goes to 100% reliability, which stops it from contributing to the compound failure charge.

Counting on the AI for deterministic work is the computation aspect of a sample I wrote about for information in “AI, MCP, and the Hidden Prices of Knowledge Hoarding.” Groups dump every part into the AI’s context as a result of the AI can deal with it—till it could possibly’t. The identical factor occurs with computation: Groups let the AI do arithmetic, string matching, or rule analysis as a result of it principally works. However “principally works” is pricey and sluggish, and a brief script does it completely. Higher but, the AI can write that script for you—which is strictly what Immediate 9 demonstrated.

Getting cascading failures out of the blackjack pipeline

I pushed the blackjack pipeline by way of eight iterations, and the outcomes taught me extra about incomes nines than I anticipated. That’s why I’m writing this text—the iteration arc turned out to be one of many clearest illustrations I’ve discovered of how the precept works in follow.

I addressed failures two methods, and the excellence issues.

Some failures referred to as for making work deterministic. Card dealing runs as a neighborhood expression step, which doesn’t require an API name, so it’s free, on the spot, and 100% reproducible. There’s a math verification step that makes use of code to recalculate totals from the precise playing cards dealt and compares them in opposition to what the LLM reported, and a method compliance step checks the participant’s first motion in opposition to a deterministic lookup desk. Neither of these steps require any AI to make a judgment name; once I initially ran them as LLM calls, they launched errors that have been onerous to detect and costly to debug.

Different failures referred to as for structural constraints that made particular error patterns more durable to provide. Chain of thought format pressured the LLM to point out its work as an alternative of leaping to conclusions. The inflexible seller output construction made it mechanically tough to skip the seller’s flip. Specific warnings about counterintuitive guidelines gave the LLM a motive to override its coaching priors. These don’t get rid of the LLM from the step—they make the LLM extra dependable inside it.

However earlier than any of that mattered, I needed to face the uncomfortable incontrovertible fact that measurements themselves will be incorrect, particularly when counting on AI to take these measurements. For instance, the primary run reported a 57% cross charge, which was nice! However once I appeared on the information myself, a number of runs have been clearly incorrect. It turned out that the pipeline had a bug: Verification steps have been operating, however the AI step that was purported to implement didn’t have enough guardrails, so nearly each hand handed whatever the precise information. I requested three AI advisors to overview the pipeline, and none of them caught it. The one factor that uncovered it was checking the mixture numbers, which didn’t add up. When you let probabilistic habits right into a step that must be deterministic, the output will look believable and the system will report success, however you don’t have any strategy to know one thing’s incorrect till you go in search of it.

As soon as I fastened the bug, the actual cross charge emerged: 31%. Right here’s how the following seven iterations performed out:

  • Restructuring the info (31% → 37%). The LLM saved shedding observe of the place it was within the deck, so I restructured the info it obtained to get rid of the bookkeeping. I additionally eliminated cut up fingers totally, as a result of monitoring two simultaneous fingers is stateful bookkeeping that LLMs reliably botch. Every repair got here from taking a look at what was really failing and asking whether or not the LLM wanted to be doing that work in any respect.
  • Chain of thought arithmetic (37% → 48%). As a substitute of letting the LLM leap to a remaining card whole, I required it to point out the operating math at each step. Forcing the mannequin to hint its personal calculations lower multidraw errors roughly in half. CoT is a structural constraint, not a deterministic alternative; it makes the LLM extra dependable inside the step, however it’s additionally dearer as a result of it makes use of extra tokens and takes extra time.
  • Changing the LLM validator with deterministic code (48% → 79%). This was the only largest enchancment in the whole arc. The pipeline had a second LLM name that scored how precisely the participant adopted technique, and it was incorrect 73% of the time. It utilized its personal blackjack intuitions as an alternative of the foundations I’d given it. However there’s a proper reply for each state of affairs in primary technique, and the foundations will be written as a lookup desk. Changing the LLM validator with a deterministic expression step recovered over 150 incorrectly rejected fingers.
  • Inflexible output format (79% → 81%). The LLM saved skipping the seller’s flip totally, leaping straight to declaring a winner. Requiring a step-by-step seller output format made it mechanically tough to skip forward.
  • Overriding the mannequin’s priors (81% → 84%). One technique required hitting on 18 in opposition to a excessive seller card, which any typical blackjack knowledge says is horrible. The LLM refused to do it. Restating the rule didn’t assist. Explaining why the counterintuitive rule exists did: The immediate needed to inform the mannequin that the dangerous play was intentional.
  • Switching fashions (84% → 94%). I switched from Gemini Flash 2.0 to Haiku 4.6, which was simple to do as a result of Octobatch enables you to run the identical pipeline with any mannequin from Gemini, Anthropic, or OpenAI. I lastly earned my first 9.

Discover the most effective methods to earn your nines

When you’re constructing something the place LLM output feeds into the following step, the identical query applies to each step in your chain: Does this really require judgment, or is it deterministic work that ended up within the LLM as a result of the LLM can do it? The technique validator felt like a judgment name till I checked out what it was really doing, which was checking a hand in opposition to a lookup desk. That one recognition was value greater than all of the immediate engineering mixed. And as Immediate 9 confirmed, the AI is usually the most effective device for writing its personal deterministic alternative.

I realized this lesson by way of my very own work on the blackjack pipeline. It went by way of eight iterations, and I feel the numbers inform a narrative. The fixes fell into two classes: making work deterministic (pulling it out of the LLM totally) and including structural constraints (making the LLM extra dependable inside a step). Each earn nines, however pulling work out of the LLM totally earns these nines quicker. The largest single leap in the entire arc—48% to 79%—got here from changing an LLM validator with a 10-line expression.

Right here’s the underside line for me: When you can write a brief perform that does the job, don’t give it to the LLM. I initially reached for the LLM for technique validation as a result of it felt like a judgment name, however as soon as I appeared on the information I spotted it wasn’t in any respect. There was a proper reply for each hand, and a lookup desk discovered it extra reliably than a language mannequin.

On the finish of eight iterations, the pipeline handed 94% of fingers. The 6% that also fail could also be trustworthy limits of what the mannequin can do with multistep arithmetic and state monitoring in a single immediate. However they might simply be the following 9 that I have to earn.

The subsequent article appears on the different aspect of this downside: As soon as you realize what to make deterministic, how do you make the entire system legible sufficient that an AI can assist your customers construct with it? The reply seems to be a sort of documentation you write for AI to learn, not people—and it modifications the best way you concentrate on what a consumer handbook is for.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles