The next article initially appeared on the Asimov’s Addendum Substack and is being republished right here with the creator’s permission.
Are LLMs dependable?
LLMs have constructed up a status for being unreliable.1 Small modifications within the enter can result in huge modifications within the output. The identical immediate run twice may give completely different or contradictory solutions. Fashions usually battle to stay to a specified format until the immediate is worded good. And it’s onerous to inform when a mannequin is assured in its reply or if it may simply as simply have gone the opposite approach.
It’s simple guilty the mannequin for all of those reliability failures. However the API endpoint and surrounding tooling matter too. Mannequin suppliers restrict the form of interactions that builders may have with a mannequin, in addition to the outputs that the mannequin can present, by limiting what their APIs expose to builders and third-party firms. Issues like the total chain-of-thought and the logprobs (the chances of all potential choices for the following token) are hidden from builders, whereas superior instruments for guaranteeing reliability like constrained decoding and prefilling aren’t made accessible. All options which might be simply accessible with open weight fashions and are inherent to the way in which LLMs work.
Each resolution made by mannequin builders on what instruments and outputs to supply to builders by their API isn’t just an architectural selection but in addition a coverage resolution. Mannequin suppliers straight decide what degree of management and reliability builders have entry to. This has implications for what apps could possibly be constructed, how dependable a system is in apply, and the way nicely a developer can steer outcomes.
The bogus limits on enter
Fashionable LLMs are often constructed round chat templates. Each enter and output, excluding instrument calls and system or developer messages, is filtered by a dialog between a person and an assistant—directions are given as person messages; responses are returned as assistant messages. This turns into extraordinarily evident when how trendy LLM APIs work. The completions API, an endpoint initially launched by OpenAI and broadly adopted throughout the trade (together with by a number of open mannequin suppliers like OpenRouter and Collectively AI) takes enter within the type of person and assistant messages and outputs the following message.2
The give attention to a chat interface in an API has its advantages. It makes it simple for builders to purpose about enter and output being fully separate. However chat APIs do extra than simply use a chat template beneath the hood; they actively restrict what third-party builders can management.
When interacting with LLMs by an API, the boundary between enter and output is commonly a agency one. A developer units earlier messages, however they often can’t prefill a mannequin’s response, that means builders can’t pressure a mannequin to start a response with a sure sentence or paragraph.3 This has real-world implications for folks constructing with LLMs. With out the flexibility to prefill, it turns into a lot more durable to regulate the preamble. If you realize the mannequin wants to begin its reply in a sure approach, it’s inefficient and dangerous to not implement it on the token degree.4 And the restrictions prolong past simply the beginning of a response. With out the flexibility to prefill solutions, you additionally lose the flexibility to partially regenerate solutions if solely a part of the reply is flawed.5
One other deficiency that’s significantly seen is how the mannequin’s chain-of-thought reasoning is dealt with. Most giant AI firms have made a behavior of hiding the fashions’ reasoning tokens from the person (and solely displaying summaries), reportedly to protect in opposition to distillation and to let the mannequin purpose uncensored (for AI security causes). This has second-order results, one in every of which is the strict separation of reasoning from messages. Not one of the main mannequin suppliers allow you to prefill or write your individual reasoning tokens. As an alternative you want to depend on the mannequin’s personal reasoning and can’t reuse reasoning traces to regenerate the identical message.
There are legit causes for not permitting prefilling. It could possibly be argued that permitting prefilling will significantly enhance the assault space of immediate injections. One examine discovered that prefill assaults work very nicely in opposition to even state-of-the-art open weight fashions. However in apply, the mannequin isn’t the one line of protection in opposition to attackers. Many firms already run prompts in opposition to classification fashions to search out immediate injections, and the identical kind of safeguard is also used in opposition to prefill assault makes an attempt.
Output with few controls
Prefilling isn’t the one casualty of a clear separation between enter and output. Even inside a message, there are levers which might be accessible on a neighborhood open weight mannequin that simply aren’t potential when utilizing a regular API. This issues as a result of these controls permit builders to preemptively validate outputs and make sure that responses comply with a sure construction, each lowering variability and enhancing reliability. For instance, most LLM APIs help one thing they name structured output, a mode that forces the mannequin to generate output in a given JSON format; nevertheless, structured output doesn’t inherently must be restricted to JSON.6 That very same method, constrained decoding, or limiting the tokens the mannequin can produce at any time, could possibly be used for rather more than that. It could possibly be used to generate XML, have the mannequin fill in blanks Mad Libs-style, pressure the mannequin to jot down a narrative with out utilizing sure letters, or even implement legitimate chess strikes at inference time. It’s a robust characteristic that permits builders to exactly outline what output is suitable and what isn’t—guaranteeing dependable output that meets the developer’s parameters.
The explanation for that is doubtless that LLM APIs are constructed for a variety of builders, most of whom use the mannequin for easy chat-related functions. APIs weren’t designed to offer builders full management over output as a result of not everybody wants or desires that complexity. However that’s not an argument in opposition to together with these options; it’s solely an argument for a number of endpoints. Many firms have already got a number of supported endpoints: OpenAI has the “completions” and “responses” APIs, whereas Google has the “generate content material” and “interactions” APIs. It’s not infeasible for them to make a 3rd, more-advanced endpoint.
A scarcity of visibility
Even the mannequin output that third-party builders do get through the mannequin’s API is commonly a watered-down model of the output the mannequin provides. LLMs don’t simply generate one token at a time. They output the logprobs. When utilizing an API, nevertheless, Google solely gives the highest 20 probably logprobs. OpenAI not gives any logprobs for GPT 5 fashions, whereas Anthropic has by no means offered any in any respect. This has real-world penalties for reliability. Log possibilities are one of the vital helpful alerts a developer has for understanding mannequin confidence. When a mannequin assigns almost equal likelihood to competing tokens, that uncertainty itself is significant info. And even for these firms who present the highest 20 tokens, that’s usually not sufficient to cowl bigger classification duties.
With regards to reasoning tokens even much less output info is offered. Main suppliers equivalent to Anthropic,7 Google, and OpenAI8 solely present summarized pondering for his or her proprietary fashions. And OpenAI solely provides that when a legitimate authorities ID is provided to OpenAI. This not solely takes away the flexibility for the person to really examine how a mannequin arrived at a sure reply, but it surely additionally limits the flexibility for the developer to diagnose why a question failed. When a mannequin provides a flawed reply, a full reasoning hint tells you whether or not it misunderstood the query, made a defective logical step, or just received unfortunate on the last token. A abstract obscures a few of that, solely offering an approximation of what really occurred. This isn’t a problem with the mannequin—the mannequin continues to be producing its full reasoning hint. It’s a problem with what info is offered to the top developer.
The case for not together with logprobs and reasoning tokens is analogous. The danger of distillation will increase with the quantity of knowledge that the API returns. It’s onerous to distill on tokens you can not see, and with out giving logprobs, the distillation will take longer and every instance will present much less info.9 And this threat is one thing that AI firms want to think about rigorously, since distillation is a robust method to imitate the talents of robust fashions for an inexpensive worth. However there are additionally dangers in not offering this info to customers. DeepSeek R1, regardless of being deemed a nationwide safety threat by many, nonetheless shot straight to the highest of US app shops upon launch and is utilized by many researchers and scientists, largely as a result of its openness. And in a world the place open fashions are getting an increasing number of highly effective, not giving builders correct entry to a mannequin’s outputs may imply dropping builders to cheaper and extra open alternate options.
Reliability requires management and visibility
The reliability issues of present LLMs don’t stem solely from the fashions themselves but in addition from the tooling that suppliers give builders. For native open weight fashions it’s often potential to commerce off complexity for reliability. All the reasoning hint is at all times accessible and logprobs are absolutely clear, permitting the developer to look at how a solution was arrived at. Person and AI messages may be edited or generated on the developer’s discretion, and constrained decoding could possibly be used to provide textual content that follows any arbitrary format. For closed weight fashions, that is turning into much less and fewer the case. The choices made round what options to limit in APIs damage builders and finally finish customers.
LLMs are more and more being utilized in high-stakes conditions equivalent to drugs or legislation, and builders want instruments to deal with that threat responsibly. There are few technical obstacles to offering extra management and visibility to builders. Lots of the most high-impact enhancements equivalent to displaying pondering output, permitting prefilling, or displaying logprobs, value virtually nothing, however can be a significant step in the direction of making LLMs extra controllable, constant and dependable.
There’s a place for a clear and easy API, and there may be some advantage to considerations about distillation, however this shouldn’t be used as an excuse to remove necessary instruments for diagnosing and fixing reliability issues. When fashions get utilized in high-stakes conditions, as they more and more are, failure to take reliability critically is an AI security concern.
Particularly, to take reliability critically, mannequin suppliers ought to enhance their API by permitting options that give builders extra visibility and management over their output. Reasoning needs to be offered in full always, with any security violations dealt with the identical approach that they’d have been dealt with within the last reply. Mannequin suppliers ought to resume offering a minimum of the highest 20 logprobs, over the whole output (reasoning included), in order that builders have some visibility into how assured the mannequin is in its reply. Constrained decoding needs to be prolonged past JSON and will help arbitrary grammars through one thing like regex or formal grammars.10 Builders needs to be granted full management over “assistant” output—they need to have the ability to prefill mannequin solutions, cease responses mid-generation, and department them at will. Even when not all of those options make sense over the usual API, nothing is stopping mannequin suppliers from making a brand new extra complicated API. They’ve achieved it earlier than. The choice to withhold these options is a coverage selection, not a technical limitation.
Bettering intelligence isn’t the one approach to enhance reliability and management, however it’s often the one lever that will get pulled.
