AI Code Evaluate Solely Catches Half of Your Bugs – O’Reilly

May 4, 2026

3

That is the fifth article in a sequence on agentic engineering and AI-driven growth. Learn half one right here, half two right here, half three right here, and half 4 right here.

I just lately had a style of humility with my AI-generated code. I reside in Park Slope, Brooklyn, and just lately I wanted to get to the opposite facet of the neighborhood. I believed I’d be intelligent: I like taking the bus, so I made a decision to hop on the one which goes proper down seventh Avenue. I do know I may verify the schedule utilizing the MTA’s actually helpful Bus Time app or web site, however it doesn’t take note of strolling time from my home or give me a good suggestion of when to go away. This appeared like a fantastic alternative to vibe code an app and do some fast AI-driven growth.

It took about two minutes for Claude Code to get my new app working. It made a beautiful little internet UI, I configured my cease and the way lengthy it takes me to stroll there, and it gave me the right departure time.

Once I really walked out the door, the app completely predicted my wait. There was only one drawback: my bus was nowhere to be seen. What I did see was a bus driving the precise wrong way down seventh Avenue.

It was fairly apparent what had occurred. I wanted to go deeper into Brooklyn, not in the direction of Manhattan, and the AI had picked the fallacious path. (Truly, as Cowork identified, every cease has its personal ID, and it had chosen the ID for the fallacious cease.) I’d been utilizing Cowork to orchestrate every little thing, and I may simply have simply requested it to exit and verify the MTA’s BusTime website for me to ensure the app was working. However I simply trusted the AI. Because of this, I needed to stroll. Which is okay—I really like strolling—however the irony was painful. I had actually simply printed an article about AI code high quality and why you shouldn’t blindly belief it, and right here I used to be doing precisely that.

The app had a bug. Nevertheless it wasn’t the sort of bug you’d essentially catch utilizing a typical AI code assessment immediate. It constructed, ran, and did a wonderfully superb job parsing the JSON from the MTA API. But when I’d began with a easy requirement—even only a consumer story like “as a Park Slope resident, I need to catch the B69 headed in the direction of Kensington so I can get deeper into Brooklyn”—the AI would have constructed it in a different way. The issue is that AI can solely construct the factor you inform it to construct, which isn’t essentially the factor you needed it to construct. AI is de facto good at writing “right” code that does the fallacious factor.

My Brooklyn bus detour was a minor inconvenience. Nevertheless it was a extremely helpful, small-scale instance of what I saved working into in my bigger initiatives, too. There’s a whole class of bugs that you just can’t discover with structural evaluation—no linter, no static analyzer, no AI code reviewer will catch them—as a result of the code isn’t fallacious in any manner that’s seen from the code alone. It’s essential to know what the code was alleged to do. It’s essential to know the intent.

The information on why necessities matter goes again a long time. Again within the Nineteen Nineties, for instance, the Standish CHAOS experiences have been a giant eye-opener for me and loads of different folks within the business, large-scale information confirming what we’d been seeing on our personal initiatives: that the most costly defects hint again to misunderstood or lacking necessities. These experiences actually underscored the concept poor necessities administration, and particularly incomplete or incessantly altering specs, have been one of the crucial major drivers behind IT undertaking failures. (And, so far as I can inform, they nonetheless are, and AI isn’t serving to issues—see my O’Reilly Radar article, “Immediate Engineering Is Necessities Engineering”).

The concept necessities issues actually are the supply of the most costly sort of defects ought to make intuitive sense: If you happen to construct the fallacious factor, you must tear it aside and rebuild it. That’s why I made necessities the muse of the High quality Playbook, an open-source talent for AI instruments like Claude Code, Cursor, and Copilot that I launched within the earlier article. I’ve spent a long time doing test-driven growth, partnering with QA groups, welcoming the harshest code evaluations from teammates who don’t pull punches—and that have led me to construct a instrument that makes use of AI to convey again high quality engineering practices the business deserted a long time in the past. I’ve examined it towards a variety of open-source initiatives in Go, Java, Rust, Python, and C#, from small utilities to widely-used libraries with tens of hundreds of stars, and it’s discovered actual bugs in virtually each undertaking it’s come throughout, together with ones which have been confirmed and merged upstream.

I feel there are loads of wider classes we will study from my expertise utilizing necessities to assist AI discover bugs—particularly safety bugs. So on this article, I need to deal with the only most vital factor I’ve discovered from constructing it: every little thing relies on necessities. Not simply any necessities, however a particular sort of requirement that the majority initiatives don’t have, that the majority AI instruments don’t ask for, and that seems to be the important thing to creating AI really helpful for verifying code high quality.

Spec-driven growth and what it misses

Builders utilizing AI instruments have been rediscovering the worth of writing issues down earlier than asking the AI to construct them. Spec-driven growth (SDD) has develop into extremely popular, and for good purpose. Addy Osmani wrote a superb piece on this, “Learn how to Write a Good Spec for AI Brokers,” and the core concept is sound: If you happen to write a transparent specification of what you need constructed, the AI produces dramatically higher outcomes than in case you simply describe it in a chat immediate and hope for the perfect.

I feel SDD is vital, and I’d encourage any developer working with AI to undertake it. However as I used to be constructing the High quality Playbook, I found that SDD has a blind spot that issues loads for code high quality. An SDD spec describes the how—what the implementation ought to appear to be. It tells the AI “implement a replica key verify” or “add a retry mechanism with exponential backoff” or “create a REST endpoint that returns paginated outcomes.” That’s helpful for constructing issues. Nevertheless it’s not sufficient for verifying them.

However a requirement doesn’t say “implement a replica key verify.” It says “customers rely upon Gson to reject ambiguous enter so that they don’t silently settle for corrupted information.” The AI can purpose about the second in methods it might probably’t purpose in regards to the first, as a result of the second has the aim connected. When the AI is aware of the aim, it might probably consider whether or not the code really fulfills that goal throughout all the sting instances, not simply those the spec explicitly listed. That’s how the High quality Playbook caught a bug in Google’s Gson library, one of the crucial broadly used JSON libraries in Java.

I feel it’s price digging into that specific bug, as a result of it’s a fantastic instance of simply how highly effective necessities evaluation may be for locating defects. The playbook derived null-handling necessities from Gson’s personal neighborhood—GitHub points #676, #913, #948, and #1558, some courting again to 2016—then used these necessities to search out that duplicate keys have been silently accepted when the primary worth was null. It confirmed the bug by producing a failing check, then patched the code and verified the check handed. I’ve used Gson for years and executed loads of work with Java serialization, so I learn the code and the repair myself earlier than submitting something—belief however confirm. The repair was merged as https://github.com/google/gson/pull/3006, confirmed by Google’s personal check suite.

That bug had been hiding in plain sight for years, by way of hundreds of exams and numerous code evaluations. Nevertheless it’s doable that no structural evaluation may need ever discovered it since you wanted the requirement to comprehend it was fallacious.

This distinction would possibly sound tutorial, however it has very concrete penalties for whether or not your AI can really discover bugs in your code.

About half of all safety bugs are invisible to structural evaluation

The safety world has identified in regards to the limits of structural evaluation for a very long time. The NIST SATE evaluations discovered that the perfect static evaluation instruments plateaued at round 50-60% detection charges for safety vulnerabilities. Gary McGraw’s Software program Safety: Constructing Safety In (Addison-Wesley, 2006) explains why: Roughly 50% of safety defects are implementation bugs, and the opposite 50% are design flaws. Static evaluation instruments goal the implementation bugs—buffer overflows, SQL injection, format string vulnerabilities—as a result of these are pattern-matchable. However design flaws are about intent: The system’s structure doesn’t implement the safety properties it’s alleged to implement, and no quantity of scanning the code will reveal that. A 2024 research by Charoenwet et al. (ISSTA 2024) confirmed that is nonetheless the case: They examined 5 static evaluation instruments towards 815 actual vulnerability-contributing commits and located that 22% of susceptible commits went solely undetected, and 76% of warnings in susceptible capabilities have been irrelevant to the precise vulnerability. The sample is constant throughout twenty years of analysis: There’s a ceiling on what yow will discover by analyzing code, and it’s round half.

There’s an excellent purpose for that limitation: the intent ceiling. A structural evaluation instrument is proscribed to studying the code and what it does; it has no approach to take note of what the developer supposed it to do.

When an AI does a code assessment with out necessities, it’s restricted to structural evaluation: sample matching, code scent detection, race situation evaluation. It could possibly ask “does this look proper?” however it might probably’t ask “does this do what it’s alleged to do?” as a result of it doesn’t know what the code is meant to do. Structural assessment catches genuinely vital stuff—race situations, null pointer points, useful resource leaks, concurrency bugs. A structural reviewer a shell script will catch a lacking fi, a nasty variable growth, a race situation. Structural assessment is beneficial, and structural assessment is what most AI code assessment instruments do as we speak.

However about half of all safety defects are intent violations: issues the code doesn’t do this it was alleged to do, or issues it does that it wasn’t alleged to do. They’re invisible with out a specification to verify towards, and no instrument will discover them by code that’s, structurally, completely sound. A structural reviewer a script that’s, say, used to verify router configuration information, would possibly discover well-formed bash, right syntax, correct quoting, and code that appears like it really works and doesn’t match identified antipatterns. It wouldn’t know the script is simply validating three of the 5 entry management guidelines it’s alleged to implement as a result of that’s a necessities query, not a syntax query.

Or, extra personally for me, that is what occurred with my bus tracker app: The JSON parsing was flawless, the UI was right, the timing logic labored completely. The one drawback was that it confirmed buses headed in the direction of Manhattan once I wanted to go deeper into Brooklyn—and no structural evaluation would ever catch that, as a result of that you must know which path I supposed to go. That’s me and my very intelligent AI hitting the intent ceiling.

The intent ceiling is a safety drawback

That is the place it will get actually critical, as a result of safety vulnerabilities are a number of the most harmful members of this class of invisible bugs.

Take into consideration what a lacking authorization verify seems prefer to an AI code reviewer. Let’s say you’ve received an internet endpoint with a well-formed HTTP handler, correctly sanitized inputs, and a protected database question. The code is clear, and passes each structural verify and static evaluation instrument you’ve thrown at it. Now you’re testing it and, a lot to your dismay, you uncover that the endpoint lets any authenticated consumer delete every other consumer’s information as a result of no one ever wrote down the requirement that claims “solely directors can carry out deletions.” That’s CWE-862: Lacking Authorization, and it rose to #9 on the 2024 CWE Prime 25 most harmful software program weaknesses.

That’s not a coding error! It’s a lacking requirement.

That’s McGraw’s level: About half of all safety defects aren’t implementation bugs in any respect. They’re design flaws, locations the place the system’s structure doesn’t implement the safety properties it was alleged to implement. A cross-site scripting vulnerability isn’t all the time a failure to sanitize enter. Typically it’s a failure to outline which inputs are trusted and which aren’t. A privilege escalation isn’t all the time a damaged entry verify. Typically there was by no means an entry verify to start with as a result of no one specified that one was wanted. These are intent violations and so they’re invisible to any instrument that doesn’t know what the software program is meant to stop.

AI code assessment instruments as we speak are excellent at catching the implementation half of McGraw’s cut up. They will spot a SQL injection sample, flag an unsafe deserialization, determine a buffer overflow. However they’re engaged on the identical facet of the 50/50 line that static evaluation has all the time labored on. The design half—the lacking authorization checks, the unspecified belief boundaries, the safety properties that have been by no means written down—requires the identical factor that catching my bus tracker bug required: realizing what the software program was alleged to do within the first place.

How the High quality Playbook derives necessities (and how one can too!)

The issue most initiatives face is that they don’t have formal necessities. What they’ve is code, documentation, commit messages, chat historical past, README information, and possibly some design docs. The query is methods to get from that mess to a specification that an AI can really use for verification.

The important thing perception I had whereas constructing the playbook was that each earlier strategy I attempted requested the mannequin to do two issues without delay: work out what contracts exist AND write necessities for them. That doesn’t work—the mannequin runs out of consideration attempting to carry your complete behavioral floor in its head whereas additionally producing formatted necessities. So I cut up them aside into 4 steps: First, have the AI learn every supply file and write down each behavioral contract it observes as a easy record. Second, derive necessities from these contracts plus the documentation. Third, verify whether or not each contract is roofed by a requirement. Fourth, assert completeness—and if there are gaps, return to the 1st step for the information with gaps.

The important thing concept is that the contracts file is exterior reminiscence. When the mannequin “forgets” a couple of behavioral contract it seen earlier, that forgetting is generally invisible. With a contracts file, each remark is written down earlier than any necessities work begins, so an uncovered contract is a visual, greppable hole.

You don’t want the High quality Playbook to do that—you’ll be able to apply the identical method with any AI coding instrument that you simply’re already utilizing. Right here’s what I’d advocate:

Write down what your software program is meant to ensure. Not simply what it does—what it’s alleged to do, for whom, beneath what situations. If you happen to’re training spec-driven growth, you’re already partway there. The subsequent step is including the why: Why does this habits matter, who relies on it, what goes fallacious if it fails? That’s the distinction between a spec and a requirement, and it’s the distinction between an AI that may construct your code and an AI that may confirm it.
Feed the AI your intent, not simply your code. The intent is already sitting in your chat historical past, your design discussions, your Slack threads, your assist tickets. Each Claude export, each Gemini dialog, each Cowork transcript incorporates design intent that by no means made it into specs: why a perform was written a sure manner, what failure prompted an architectural choice, what tradeoffs have been mentioned earlier than selecting an strategy. The design intent that used to require a human to extract and doc is now sitting in your chat logs. Your AI can learn the transcripts and extract the why.
Search for the unfavourable necessities. What ought to your software program not do? What states must be unattainable? What information ought to by no means be uncovered? These unfavourable necessities are sometimes essentially the most useful as a result of they outline boundaries that structural assessment can’t see. The lacking authorization bug was a unfavourable requirement: Unauthenticated customers should not be capable of delete different customers’ information. The Gson bug was a unfavourable requirement: Duplicate keys should not be silently accepted when the primary worth is null. If you happen to can articulate what your software program must not ever do, you’ve given the AI one thing highly effective to verify towards.

Within the subsequent article, I’ll discuss context administration—the talent that truly determines whether or not your AI periods produce good work or mediocre work. Every little thing I’ve described right here relies on the AI having the proper info on the proper time, and it seems that managing what the AI is aware of (and what it forgets) is an engineering self-discipline in its personal proper. I’ll cowl how I went from working 15 million tokens in a single immediate to splitting the playbook into unbiased phases with zero context carryover, and why that transition labored on the primary strive.

The High quality Playbook is open supply and works with GitHub Copilot, Cursor, and Claude Code. It’s additionally obtainable as a part of awesome-copilot.

Disclosure: Facets of the methodology described on this article are the topic of US Provisional Patent Utility No. 64/044,178, filed April 20, 2026 by the creator. The open-source High quality Playbook undertaking (Apache 2.0) features a patent grant to customers of that undertaking beneath the phrases of the Apache 2.0 license.

AI Code Evaluate Solely Catches Half of Your Bugs – O’Reilly

Spec-driven growth and what it misses

About half of all safety bugs are invisible to structural evaluation

The intent ceiling is a safety drawback

How the High quality Playbook derives necessities (and how one can too!)

Related Articles

NATO chief says Europeans have ‘gotten the message’ from Trump on defence | European Union Information

Maul Places The Sith Lord In A Place We have By no means Seen Earlier than [Exclusive]

Hokum’s Editor Reveals How The Film’s Scariest Scene Got here Collectively [Exclusive]

LEAVE A REPLY Cancel reply

Latest Articles

NATO chief says Europeans have ‘gotten the message’ from Trump on defence | European Union Information

Maul Places The Sith Lord In A Place We have By no means Seen Earlier than [Exclusive]

Hokum’s Editor Reveals How The Film’s Scariest Scene Got here Collectively [Exclusive]

Making a Monetary Report for Small Enterprise

The best way to Get Butter Out of Garments: What Truly Works