Many AI agent techniques change into economically unsustainable lengthy earlier than they change into technically spectacular. Groups normally concentrate on mannequin alternative, immediate design, software calling, and orchestration. These issues matter, however they’re solely a part of the system setup. The deeper difficulty is that coding brokers, equivalent to Claude Code, Codex, and Jules, make agent workflows simpler to generate. However when implementation is abstracted away, the underlying mechanics change into tougher to see. Unhealthy engineering used to supply sluggish code. Now it produces costly techniques that additionally occur to be sluggish.
After we design agent techniques, we nonetheless must keep in mind that the prices scale nonlinearly. A single person request not often triggers a single mannequin name. It expands into routing, retrieval, reasoning, reflection, guardrail checks, software calls, and synthesis. Every step could repeat shared context, reload state, recompute a planner choice, or retry a failed path. What seems like an clever workflow can due to this fact behave like a recursive, stateful computation with overlapping subproblems. If that seems like backtracking, dynamic programming, and memoization to you, you’re proper.
We already know how one can optimize techniques like this. The issue is that coding brokers make agent techniques simpler to generate, however not essentially simpler to optimize. Except we acknowledge the underlying mechanics, we could by no means ask our coding brokers to use the optimization patterns that preserve our techniques viable.
Outdated issues sporting new garments
After we use coding brokers to generate agent architectures, it’s tempting to cease at “the hint seems affordable.” The software can generate routers, retrievers, planners, evaluators, guardrails, software interfaces, and synthesis steps. It could additionally find out about caching, pruning, memoization, and state modeling. But it surely gained’t essentially implement these patterns until you ask for these optimization layers explicitly.
Even in the event you work with agent directions, until your SKILL.md, AGENTS.md, or mission directions embody constraints round repeated context, memoization, cache invalidation, pruning, and price per request, your ensuing agent system could also be functionally right and economically wasteful on the identical time. That’s the tough half: The code can cross overview, the unit exams can cross, and the structure can look affordable. The bill is the place the hidden computation lastly reveals up.
It’s simple to offer an excessive amount of company to instruments like Claude Code. When a coding agent causes in language, calls instruments, displays, and produces fluent textual content or code, it will probably really feel like a educated coworker. On the interface stage, that impression is comprehensible. These instruments assist groups generate extra code, transfer quicker, and change into extra productive. Nonetheless, this doesn’t take away the necessity for engineering craft beneath. Somebody nonetheless has to acknowledge repeated context, recomputed planner choices, correlated retries, unpruned branches, and state that may’t be reused. The coding agent can implement the system, however the engineer nonetheless has to know what sort of system needs to be carried out. That is the place outdated laptop science returns, not as concept however because the optimization layer our agent techniques want in manufacturing.
The fee multiplier, repeated-work issues, and backtracking
The fee multiplier typically reveals up first as latency. The person doesn’t see the router, the retries, the reflection loop, or the software calls. They solely see that the agent is taking too lengthy. From the surface, the system seems caught or damaged. From the within, it might merely be repeating work.
This is among the uncomfortable variations between conventional software program and agent techniques. In a standard utility, a failed operation typically throws an error, instances out, or leaves a hint that’s simple to examine. In an agent workflow, failure can appear to be effort to enhance reliability. Take the weakest step in your agent workflow. If it succeeds 60% of the time, and also you attempt to push it near 99% reliability by retries, you want 5 retries:
1 − (1 − 0.60)5 = 0.98976
This math assumes every retry is a roll of truthful cube. LLMs aren’t cube. Whether or not you’re utilizing grasping decoding or probabilistic sampling, the mannequin remains to be drawing from the identical underlying distribution formed by your immediate. If the primary “thought” is a hallucination or logic error, bumping the temperature gained’t repair the underlying state. You aren’t shopping for impartial trials; you’re simply sampling totally different paths by the identical flawed map and state.
That is the place the outdated algorithmic framing issues. In a backtracking downside, you don’t preserve strolling down the identical failed department and name it progress. You come back to the final legitimate state, mark the failed path, and use the failure as info for the subsequent alternative. The purpose isn’t simply to strive once more. The purpose is to strive once more underneath a modified state.
Agent workflows want the identical self-discipline. A retry shouldn’t imply “run it once more and hope.” It ought to give the mannequin structured suggestions about why the earlier try failed: which constraint failed, which software end result was invalid, which schema didn’t validate, which assumption was unsupported, or which department added nothing. The subsequent try ought to then change one thing significant: the immediate, the software alternative, the retrieved proof, the validation constraint, or the planner state.
Memoization, pruning, and dynamic programming
Immediate caching is normally the primary optimization. If each step repeats the identical system immediate, software definitions, schema constraints, examples, and coverage guidelines, then caching the shared prefix is an apparent win. It reduces the price of repeated context. However immediate caching solely acknowledges that textual content repeats. It doesn’t discover that choices repeat.
In lots of agent techniques, the costly unit isn’t solely textual content. It’s the repeated choice. If the identical or equal state seems once more, paying the mannequin to rediscover the identical motion is pointless. That’s what memoization does: It turns repeated computation into lookup. In classical algorithms, the repeated computation may be a recursive subproblem. In an agent system, it may be a planner choice over the identical process, details, instruments, and constraints. The planner could be handled as a perform over state:
the place is the present state of the workflow and is the subsequent motion. With out memoization, this perform is evaluated many times by an LLM name. With memoization, the system first checks whether or not it has seen the identical or equal state earlier than. If you would like a deeper walkthrough of how one can use memoization, I cowl it in AI Brokers: The Definitive Information.
However memoization solely helps as soon as the system is aware of which states are price revisiting. Pruning handles the opposite aspect of the issue: branches that shouldn’t be explored additional. Nonetheless, don’t restrict pruning to KV cache pruning or speculative decoding. Use it additionally when a software repeatedly returns no new info. Your subsequent LLM name shouldn’t be a barely reworded model of the identical question. If a mirrored image loop retains producing stylistic adjustments with out bettering correctness, the loop ought to cease. If a search path violates a constraint or depends upon an unsupported assumption, it needs to be marked as unproductive and faraway from the lively search area.
Dynamic programming turns into related when totally different branches of the workflow remedy overlapping subproblems. A analysis agent could ask comparable questions throughout a number of paperwork. A coding agent could examine the identical dependency chain from totally different entry factors. A enterprise evaluation agent could compute the identical metric for a number of report sections. If each department solves these subproblems from scratch, the system pays repeatedly for work it has already completed. Desk 1 reveals examples of how these patterns map to AI agent techniques.
Desk 1. Classical optimization patterns utilized to AI agent techniques
| Optimization | The “outdated” CS manner | The “agent” manner |
| Memoization | Retailer outcomes of high-priced perform calls. | Cache choices. If the agent noticed this state earlier than, don’t ask it to purpose once more. |
| Pruning | Reduce off search paths in a tree that gained’t result in an answer. | Kill a mirrored image loop when the critique stops yielding structural enhancements. |
| Dynamic programming | Break issues into overlapping subproblems. | Share codebase evaluation throughout a number of specialised brokers as a substitute of rereading recordsdata. |
This isn’t nostalgia. These patterns mitigate the associated fee construction of agent techniques. Memoization reduces repeated choices. Pruning reduces repeated failure. Dynamic programming reduces repeated subproblem fixing. Collectively, they type the optimization layer many agent architectures are lacking in manufacturing.
The place to begin: Optimization follows topology
The patterns above aren’t a guidelines you apply uniformly. Every multi-agent topology, whether or not centralized, decentralized, impartial, or hybrid, distributes communication and coordination in another way, which immediately impacts overhead, latency, and failure propagation. The optimization layer has to observe.
Centralized
A single orchestrator decides, delegates, and aggregates. The costly unit is the orchestrator’s choice, repeated throughout comparable inputs. Memoize the planner first.Decentralized
Brokers coordinate peer-to-peer, exchanging messages with out a government. The fee strikes into the communication itself: redundant exchanges, restated context, brokers reasoning over the identical shared state from totally different angles. Immediate caching on the shared context is the primary win, adopted by pruning exchanges that now not add info.Unbiased/swarms
Light-weight brokers fan out with out coordinating. Low cost individually, costly in mixture. If three of your ten brokers ask semantically equal questions, you pay 3 times for a similar reply. Memoization and pruning aren’t optimizations right here; they’re load-bearing.Hybrid
The repeated work reveals up at two scales: inside a cluster (overlapping subproblems amongst friends) and throughout clusters (the coordinator rediscovering the identical routing choice). Use dynamic programming on shared subproblems contained in the cluster, memoization on the coordinator’s choices throughout them.
The optimization layer isn’t a generic self-discipline you bolt on. It’s a perform of the form of the implementation. Coding brokers made it simple to generate the form with out seeing it. The craft is in seeing it anyway.
