Friday, June 26, 2026

Agentic Code Assessment – O’Reilly


The next article initially appeared on Addy Osmani’s weblog website and is being republished right here with the creator’s permission.

Coding brokers are terribly good now, and getting higher quick. The fascinating consequence is that the exhausting a part of engineering moved from writing code to deciding whether or not to belief it, which makes evaluate essentially the most leveraged talent in software program proper now. The way you method it relies upon enormously on who you’re: A solo developer with no customers and a crew sustaining a 10-year-old utility should not fixing the identical downside.

I’m extra optimistic about agentic engineering than I’ve ever been. The brokers are genuinely good, they get higher each month, and on an atypical day I now ship issues I’d not have tried a 12 months in the past. This write-up is a map of the place the fascinating work went, as a result of it did transfer, and most groups haven’t totally caught as much as the place.

Code evaluate used to work due to a contented accident of relative velocity. A senior engineer may learn code sooner than a junior may write it, so evaluate saved tempo with out anybody designing it to, and the crew absorbed how the system match collectively as a facet impact of studying one another’s diffs. Quite a lot of that was not deliberate. It fell out of a single truth: Writing code was the gradual, costly half, and studying it was low-cost and quick.

That truth not holds. An agent will produce a thousand traces of typically strong, well-formatted code in much less time than it takes me to learn this paragraph, whereas a human’s studying velocity has not modified since roughly the day we began looking at screens for a residing. So the constraint moved downstream, to the one step that didn’t get sooner: an individual being assured the change is correct. I don’t assume that’s a loss. It’s essentially the most leveraged place in software program to be good proper now, and it’s the place I’ve put most of my consideration this 12 months.

There’s a contented twist right here that shapes the remainder of this piece. The identical instruments producing all that further code are additionally the perfect factor I’ve for maintaining with it. By myself tasks, together with the favored open supply ones, I now level Claude Code or Codex at a batch of incoming PRs and have them triage the queue for me, and that has genuinely modified how I spend my time. So this isn’t an anti-AI argument, and I’ll come again to precisely how I take advantage of AI.

It’s additionally not an information dump, and never one other spherical of whether or not letting a mannequin write your code is fantastic or the top of the craft, as a result of that framing is ineffective. The one reply that survives contact with an actual codebase is that it relies upon completely on who you’re. A developer vibe-coding a facet mission solely a dozen folks will ever run and a crew holding a 10-year-old enterprise system alive for one more quarter share virtually no constraints value naming, and a lot of the recommendation in circulation is absolutely a type of two folks telling the opposite tips on how to dwell.

What the 2026 information truly exhibits

The productiveness features from AI are actual, however uncooked output overstates them: about 4 instances the code for a tenth extra delivered worth. The hole between these numbers is evaluate work, which is strictly why evaluate is the place the leverage now sits.

For a few years this was an anecdotal argument. It’s now measured at scale, by organizations with no shared agenda and in a number of instances competing industrial pursuits, and the measurements maintain pointing the identical manner: AI pushes output sharply up and pushes each high quality and reviewability down.

Faros AI instrumented 22,000 builders throughout 4,000 groups and tracked what occurred as groups moved from low to excessive AI adoption. That is March 2026 information, about as present as something right here. The upside is actual. Builders merge significantly extra PRs and full extra work and throughput per engineer climbs. Then the remainder of the report:

  • Code churn is up 861%.
  • The incidents-to-PR ratio is up 242.7%.
  • The per-developer defect fee is up from 9% to 54%.
  • Median evaluate length is up 441.5%, with time to first evaluate and common evaluate time each roughly doubling.
  • PRs merged with zero evaluate are up 31.3%.

The final determine is the one I discover hardest to dismiss, as a result of no one selected to cease reviewing. Reviewers merely couldn’t maintain tempo with the amount, so code started merging unread, and that grew to become regular. The element I maintain returning to is that groups with mature, disciplined engineering practices had been hit simply as exhausting as everybody else. Good course of didn’t shield them, as a result of the amount arrived sooner than any course of was designed to soak up.

CodeRabbit studied 470 open supply PRs in December 2025, 320 AI-coauthored and 150 human-only, and located the AI adjustments carried roughly 1.7x extra points. Logic and correctness issues had been up about 75%, safety points had been 1.5 to 2x extra frequent, and readability issues greater than tripled. The corporate’s AI director, David Loker, described these as “predictable, measurable weaknesses that organizations should actively mitigate.” Predictable is the operative phrase. These are recognized, locatable weaknesses, which is sweet information: It means a evaluate course of, human or automated, might be aimed straight at them.

One caveat to carry all through: CodeRabbit and Faros each promote into this market, so their framing isn’t disinterested. That doesn’t make the numbers unsuitable—the impact sizes are massive and constant throughout unrelated sources—however vendor analysis deserves to be learn with that in thoughts.

GitClear has the only quantity I’d lead with. In its productiveness information by 2025, every day AI customers produce round 4x the uncooked output of nonusers, however measured towards their very own output a 12 months earlier, the true productiveness acquire is simply about 12%. You’re producing roughly 4 instances the code for one thing like a tenth extra delivered worth, and a human nonetheless has to evaluate all of it. To GitClear’s credit score, CEO Invoice Harding is specific that a few of even that 12% is choice bias, as a result of stronger builders are concentrated within the AI cohort.

GitHub reviews that Copilot evaluate has now run over 60 million evaluations, a 10x enhance in underneath a 12 months, and a couple of in 5 evaluations on the platform includes an agent. That is not a distinct segment apply. It’s how code will get made.

4 datasets, 4 strategies, one conclusion. We poured machine-speed output right into a system constructed for human-speed work. The bottleneck didn’t disappear; it moved to verification, and evaluate is the place that invoice comes due.

Everyone seems to be fixing a distinct downside

How a lot evaluate a change wants relies upon virtually completely on its blast radius, and most recommendation you learn was written by somebody working for a really totally different one.

Nearly all of the alarming information above comes from enterprise telemetry and from open supply maintainers being overwhelmed. It’s completely actual if that’s your state of affairs. When you’re one particular person delivery one thing a handful of individuals will ever run, a lot of it merely doesn’t apply to you, and also you shouldn’t be made to really feel in any other case.

Three variables decide the place you sit:

  • Blast radius: What occurs when it breaks? Nothing, or indignant customers and cash and PII on the road?
  • How lengthy the code lives: A throwaway prototype you would possibly rewrite subsequent week, or a codebase you’ll keep for years?
  • How many individuals want to know it: Simply you holding the entire thing in your head, or a crew that has to share possession over time?

Run the identical diff by these three variables, and “good evaluate” means genuinely various things.

When you’re working solo on a greenfield mission with no customers, evaluate’s second job, distributing data throughout a crew, doesn’t exist for you. You are the crew. The cheap transfer is to lean exhausting on exams and automation, evaluate the elements that genuinely matter, and settle for a lighter contact on the remaining. Duplication and churn price far much less when the code might not exist in a month and no one is paged at 3:00am when it breaks. The catch, and other people be taught this one painfully, is that it solely works if the exams are actual. Skipping evaluate with out a security internet doesn’t take away the work. It defers it at a better value, and requirements slip when nobody is there to push again. “No customers” is permission to defer evaluate. It isn’t permission to skip verification.

Then the mission will get customers. That is the harmful center, and the crossing isn’t seen on the time. Assessment’s bug-catching position out of the blue issues, as a result of bugs now damage folks, and its knowledge-sharing position switches on, as a result of it’s not solely you. Groups maintain their solo-era habits just a few months too lengthy, after which there’s a postmortem and the Faros numbers cease being a chart and grow to be their very own dashboard.

On the far finish is the big group with an outdated codebase and lots of customers. Right here each alarming determine lands at full power. A duplicated helper isn’t a method nit; it’s a future bug floor and a upkeep price that compounds for years. A change no one understood is comprehension debt that turns into somebody’s on-call incident. Assessment is doing a number of jobs directly, and the amount of agent output quietly breaks all of them. The Faros discovering about mature groups is aimed squarely right here.

So the purpose isn’t “Enterprises needs to be cautious and solo builders can chill out.” It’s that the aim of evaluate adjustments along with your place, so the principles have to alter with it. Bolt an enterprise’s locked-down multi-agent evidence-required pipeline onto a two-person prototype and also you’ve added friction for no profit. Run “exams move, ship it” on a funds system and also you’ve constructed an incident generator with a inexperienced checkmark on prime. Most dangerous recommendation on this area is one place on that spectrum prescribing to a different.

What evaluate is definitely for now

Assessment was constructed to examine an creator’s reasoning. An agent does purpose, however that reasoning is normally thrown away somewhat than connected to the code, so the reviewer has to reconstruct a rationale that by no means made it into the diff. The excellent news is that it is a tooling downside, and capturing the reasoning makes evaluate dramatically simpler.

That is the half that genuinely modified, and I believe it’s underappreciated.

When a human writes code, intent comes alongside totally free. The reasoning, the alternate options weighed and discarded, lived within the creator’s head, and evaluate was you checking that reasoning. Fashionable brokers do purpose, typically visibly, producing pondering traces and weighing choices and explaining themselves as they go. The catch is that this reasoning is normally discarded the second the diff is produced. It’s not often captured and infrequently connected to the PR, and in any case it’s the agent’s reasoning about tips on how to implement the duty, not a human’s judgment about whether or not it was the correct process to start with. So evaluate shifts from checking reasoning that sits in entrance of you to reconstructing intent that by no means obtained written down, which is more durable and slower, and we maintain performing shocked that it takes 441% longer.

A 2026 paper, “AI Slop and the Software program Commons,” analyzed 1,154 posts throughout 15 Reddit and Hacker Information threads the place builders mentioned “AI slop.” One line from a developer has stayed with me: reviewing an agent’s PR made them “the primary human being to ever lay eyes on this code.”

That sentiment factors straight on the repair. In regular evaluate, the creator already understood the change and also you had been checking their work. With an agent PR, no one has reconstructed the why but, and the reviewer is the primary to strive. Because the paper places it, evaluate “wasn’t constructed to recuperate lacking intent.” The encouraging half is that lacking intent is recoverable: The reasoning existed; we simply discarded it. Have the agent state what it was making an attempt to do and what it dominated out, then seize it as a call log on the PR, and a big a part of the reconstruction price disappears. It is a tooling downside, and tooling issues get solved.

None of which makes “have the AI evaluate the AI” a whole reply by itself. A second mannequin with totally different priors genuinely catches actual bugs, and it catches a variety of them, which is why you must run one. What it doesn’t provide is the human judgment about whether or not that is the correct change to construct within the first place. That judgment stays with an individual, and it occurs to be essentially the most fascinating a part of the job and the half value holding.

The instruments are good, however not at all times for the rationale they promote

The present AI reviewers are genuinely good, and so they often don’t flag the identical traces as one another, so the correct transfer isn’t selecting the perfect one however working two which might be constructed otherwise.

The devoted AI evaluate instruments are good now, and I believe you need to be working at the least one on every part, facet tasks included. CodeRabbit is essentially the most extensively deployed and topped the impartial Martian benchmark (January to February 2026) on F1, at round 49% precision with the perfect recall within the subject. Greptile trades precision for recall, with round an 82% bug-catch fee towards CodeRabbit’s 44% in a single benchmark, at the price of extra false positives. Anthropic’s Code Assessment reviews underneath 1% of its findings marked incorrect by their engineers; the determine I’d truly present a supervisor is that it raised their inside fee of PRs receiving a substantive evaluate from 16% to 54%. The lengthy tail of adjustments that used to get a look and an approval now will get learn by one thing.

Probably the most helpful outcome I’ve seen this 12 months isn’t from a vendor. An engineer ran 4 reviewers in parallel, CodeRabbit, Sentry Seer, Greptile and Cursor BugBot, throughout 146 actual PRs and 679 findings over three and a half weeks:

Of 617 distinct flagged areas, 93.4% had been caught by precisely one of many 4 instruments. 6% by two. Nearly none by three. None in any respect by all 4.

The 4 instruments by no means as soon as flagged the identical line. Every was robust at a distinct class of downside: Greptile with near-zero false positives on correctness and structure, CodeRabbit with the widest internet and one-click fixes, and Seer greatest on production-failure severity. That’s the adversarial evaluate argument demonstrated on an actual codebase somewhat than in a paper. Heterogeneity is the entire level. 4 copies of 1 mannequin is a single reviewer with a bigger bill, whereas 4 genuinely totally different reviewers floor a set of bugs no single member may discover alone, the human included.

In apply: Don’t agonize over the only greatest instrument as a result of there isn’t one. On the high-stakes finish, run two with intentionally totally different characters. (The experiment above paired Greptile for on a regular basis correctness with Seer for production-failure severity, with virtually no overlap.) If you’re solo, one good reviewer plus actual exams is loads. And regardless of the advertising and marketing says, measure it by yourself code, as a result of each one in all these outcomes was particular to a specific codebase, and yours will likely be too.

Ought to we simply let AI evaluate extra of it?

The machine is already reviewing extra of your code than you’re. The one actual determination left is whether or not you try this intentionally, and the quantity of human you retain ought to scale along with your blast radius.

I maintain listening to a query from skilled engineers that might have been heresy a 12 months in the past: Ought to the machine be doing extra of the reviewing, maybe most of it? I not assume that’s a silly query.

The uncomfortable half is that AI evaluate works. Underneath 1% of Anthropic’s findings are marked unsuitable; the instruments catch bugs people learn straight previous, and so they don’t get drained on the thirtieth PR of the day, which is strictly when a human is least dependable. In the meantime people are visibly not maintaining: Zero-review merges are up 31% and evaluate instances are up triple digits. In an actual sense the machine is already reviewing extra of the code than we’re. The sincere framing isn’t “Ought to we let AI evaluate extra?” however “AI is already doing it, so are we going to be deliberate about that or let it occur by default whereas pretending people nonetheless learn every part?”

Loop engineering sharpens this. The premise of a loop is that you just cease being the one that prompts the agent and as an alternative construct a system that prompts it, and a central a part of that system is a decide: an agent that decides whether or not the work is completed earlier than shifting on. The reviewer is the subsequent position being designed out of the interior loop, on function. We spent a 12 months automating the writing, and the loops are actually automating the checking, and the human retains getting pushed up and out. “The place does the human keep?” isn’t a seminar query; it’s one thing you resolve each time you wire up a loop, whether or not or not you notice you’re deciding it.

The place I presently land, and I maintain this loosely: The reply isn’t “a human reads each line.” That’s over. The amount ended it, and anybody insisting in any other case is describing a world that not exists. However it’s additionally not “let the loop evaluate itself and stroll away.” When an agent writes the code, one other evaluations it, and a 3rd judges it, you’ve a closed loop of fashions with broadly correlated blind spots, particularly once they come from the identical household, confidently agreeing in the identical locations. A assured “seems to be good” with no human wherever in it’s borrowed confidence: The system’s certainty turns into yours, and no one truly understood something. The loop might be each very positive and really unsuitable, with no human left to inform the distinction.

So the human doesn’t go away; the human strikes up a stage. You cease reviewing each diff and begin proudly owning the elements that don’t switch to a mannequin. Accountability, as a result of you’ll be able to’t web page a mannequin at 3:00am. The judgment of whether or not that is even the correct change to construct, as distinct from whether or not the code is appropriate. The high-blast-radius gates the place being unsuitable is pricey. And the awkward one: the habits no one specified, as a result of a mannequin evaluations the code that exists and infrequently flags the requirement that no one thought to jot down down, which stays a human-shaped hole I don’t anticipate to shut quickly. Human within the loop turns into human on the loop: sampling, spot-checking and auditing the system somewhat than studying each PR, and spending your restricted consideration the place being unsuitable would truly damage.

That is already how I work by myself tasks, together with the open supply ones that now see extra PRs in a day than I may fastidiously learn in a night. I level Claude Code or Codex at a batch of incoming PRs and ask for a primary move: a high-level learn of what seems to be protected to merge, what wants extra work, and what’s genuinely high-risk. I don’t auto-merge on the outcome, and I don’t lazy-merge no matter it approves. What it provides me is a approach to allocate consideration. I can spend a couple of minutes confirming the adjustments it considers low danger, and put actual, cautious time into those it flags as harmful. The element that issues is that this isn’t my outdated evaluate hour made barely sooner. It’s a distinct form of hour, and on the quantity I now take care of, it’s the primary purpose the queue stays survivable in any respect.

Codex and Claude Code giving me a first-pass, risk-sorted learn of a batch of PRs. The triage is the assistance. The merge determination stays mine.

A extra excessive model of the identical transfer is Kun Chen, an ex-Meta L8 engineer now delivery round 40 PRs a day as a solo builder, who has largely stopped reviewing code. It could be straightforward to dismiss this, besides he’s an L8, unusually good on the factor he stopped doing. He runs 20 to 30 brokers in parallel and has moved his effort into the plan: He writes detailed plans up-front; the brokers run for hours towards them, and he says plan high quality determines how lengthy they’ll run unattended. That’s the transfer I described above in its purest kind. It’s value being exact about what truly occurred, as a result of it isn’t that he stopped verifying. The intent didn’t vanish; he wrote it down himself within the plan, so the “first human to ever lay eyes on this” downside is half-solved. A human did perceive the why, simply up-front somewhat than after. And he didn’t work with out a internet. He constructed an automatic evaluate gate (which he calls No Errors) that checks the code earlier than it merges, and he stays on escalation when an agent will get caught. The human does the costly pondering earlier than the code exists, and the machine does the line-by-line afterward, which might be the form of the place this goes.

However he’s a solo builder with no massive crew and no decade-old system filled with landmines beneath him. The precise situations that make 40 PRs a day with out evaluate rational for him are situations most readers don’t have. Copy his workflow onto a crew delivery to many customers and also you reproduce the Faros numbers by yourself dashboard. Kun isn’t unsuitable; he’s only a good distance down one particular finish of the spectrum.

Which is the spectrum level once more. Solo with no customers: Letting AI evaluate virtually all of it’s a defensible 2026 place, and also you shouldn’t really feel responsible about it. Sustaining one thing massive for many individuals: Let the machine deal with the primary move, the second move, and the boring 90%, however maintain an actual human on the load-bearing paths and don’t let the loop shut utterly on something that may damage somebody. How a lot human you retain is a dial, and also you set it by blast radius, not by guilt.

What to truly do

Cease reviewing every part to the identical depth. Spend scarce human consideration solely the place being unsuitable is expensive, and let low-cost deterministic gates and AI reviewers deal with the remaining.

The organizing thought is to match evaluate effort to the price of being unsuitable, push a budget deterministic work as early as doable, and reserve human consideration for what solely people can do.

Tier by danger, not by creator. A config change earns a linter and a look. A funds path earns the complete stack: varieties, exams, two totally different AI reviewers, a human who owns that system, and a safety move. Don’t spend a heavy evaluate on boilerplate, and don’t wave by an auth change as a result of the exams are inexperienced. The layered method is similar in all places; what adjustments is what number of layers a given diff has to clear.

Quick-fail the costly tail. Probably the most helpful current discovering for groups drowning in agent PRs is “Early-Stage Prediction of Assessment Effort” (January 2026), which studied 33,707 agent-authored PRs. Brokers are good at small, well-defined adjustments. Round 28% merge virtually immediately, however they have a tendency to “ghost” the second they get subjective suggestions, abandoning the back-and-forth that evaluate truly is. (A companion 2026 paper discovered reviewer abandonment accounted for 38% of rejected agent PRs.) The researchers constructed a “circuit breaker” that predicts high-maintenance PRs from low-cost alerts like file varieties and patch dimension earlier than a human seems to be, and it really works properly. Triage agent PRs up entrance, fast-track the trivial ones, and don’t let an individual sink an hour right into a sprawling change the agent will abandon as quickly as you push again.

Increase the bar for what you’ll even evaluate. The repair for being buried isn’t locking down the repository. It’s refusing to evaluate adjustments that arrive with out proof. Require, earlier than evaluate, an announcement of what the change is for, a diff that isn’t 3,500 traces with no feedback, the check output, and proof it was truly run. That is the way you cease being the primary human to learn the code. You push the intent-reconstruction work again onto whoever submitted it, the place it’s low-cost, somewhat than absorbing it your self, the place it’s costly.

Hold PRs small, intentionally. Agent PRs run massive, 51% bigger on common within the Faros information, and reviewer engagement is without doubt one of the strongest predictors {that a} PR merges in any respect. A big unreviewable PR will get rejected outright or, worse, rubber-stamped. Instruct your brokers to provide small commits. A diff a human can truly learn is now a design constraint, not a courtesy.

Learn the check adjustments extra fastidiously than the code. That is the agent failure mode to observe. The agent adjustments habits, then “fixes” the check by rewriting the assertion to match the brand new, damaged habits. A inexperienced examine over 200 edited exams means nothing till you may have confirmed the edits had been appropriate. Deal with any diff that rewrites many exams as a flag and skim these first. Mutation testing earns its place right here: Protection tells you a line ran; mutation testing tells you whether or not the check would discover if that line had been unsuitable.

Deal with CI because the wall that doesn’t transfer. Look ahead to the patterns GitHub now warns reviewers about: eliminated exams, skipped lint, lowered protection thresholds, a duplicated helper that already exists elsewhere, and untrusted enter flowing right into a immediate. That final one deserves emphasis, as a result of agent-built options are a recent supply of immediate injection: If a change pipes user-controlled textual content into an LLM name with out interested by what that textual content can instruct the mannequin to do, the vulnerability isn’t seen within the diff. It’s latent within the information that may arrive later. Brokers will even weaken CI to make themselves move, not maliciously, simply gradient descent discovering the most cost effective path to inexperienced. Deterministic gates are the one a part of the pipeline that may’t be talked out of their verdict by a assured paragraph, so maintain them strict.

A human owns the merge. A mannequin can’t be paged and may’t be held accountable for what it shipped, so whoever clicks merge owns it. When an AI evaluate says “seems to be good” in a relaxed, assured voice, it’s handing you confidence it hasn’t essentially earned. Deal with each AI evaluate as a sensor, not a verdict: information, not a call.

If you’re solo with no customers, the tiering, the test-change self-discipline, and CI are most of what you want; the remaining is overhead till folks present up. When you’re a big group, all of it’s the baseline, and the triage and consumption bar are the distinction between a evaluate course of that scales and one which quietly collapses.

What this implies in case you run a crew

The bottleneck is not how briskly you write code. It’s how briskly a trusted human might be assured in a evaluate. Slicing the individuals who present that confidence as a result of “AI made us sooner” merely converts the saving into future incidents.

The binding constraint on delivery is now how briskly a trusted human might be assured a change is appropriate. Any plan that treats era because the bottleneck and evaluate as free will quietly stall, with the speed dashboard staying inexperienced the entire manner.

The Faros report is direct about this: QA and evaluate work rises whilst output rises, so lowering engineering headcount as a result of “AI made us sooner” is harmful until you may have closed the evaluate hole first. The senior-engineer tax (evaluate time up by triple digits) falls hardest on the folks you’ll be able to least afford to bottleneck, and it’s invisible to any metric that solely counts merged PRs.

Open supply maintainers hit this wall first and hardest. The regular stream of believable however hole contributions prices actual triage time even when these contributions are well-intentioned, and that’s the canary. Corporations are subsequent. Those dealing with it properly deal with evaluate capability as an actual useful resource to be measured, protected, and spent intentionally, not as slack that AI has freed up.

Writing obtained low-cost however understanding didn’t

Code evaluate didn’t grow to be much less necessary when brokers arrived. It grew to become the central exercise. Writing code is more and more solved and getting cheaper by the month; the sturdy benefit is the system that permits you to belief what was written.

Don’t take the one-size reply in both route. When you’re solo with no customers, the enterprise horror tales about churn and duplication are a future danger, not right now’s hearth, so lean in your exams, evaluate what issues, and keep sincere that the deferred work remains to be owed. When you keep one thing massive for many individuals, each alarming quantity right here is about you, and the one factor that holds is a tiered, evidence-required, intentionally heterogeneous evaluate course of with a human proudly owning the merge.

What’s fixed throughout the entire spectrum is the underlying economics. We made writing low-cost, and understanding stayed precisely as costly because it has at all times been. The groups that do properly over the subsequent few years received’t be those producing essentially the most code; they’ll be those who constructed a evaluate system they’ll truly belief, and who by no means confuse “the exams handed” with “an individual understands what this does and why.”

Or, as Simon Willison retains placing it, “your job is to ship code you may have confirmed to work.” Brokers haven’t modified that. They’ve made “proving” the middle of the job somewhat than an afterthought, and I believe that’s a superb commerce. Understanding a system properly sufficient to face behind it’s the most sturdy and most fascinating talent in software program, and there has by no means been a greater time to get terribly good at it.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles