What you’ll pay for AI agents will be wildly variable and unpredictable


gettyimages-79903157

Hill Street Studios/Fuse / Getty Images Plus

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways 

  • AI’s cost in terms of tokens soars when using agents.
  • Agents are inconsistent and can’t predict their total token usage.
  • Users must demand price transparency and performance guarantees.

Among all the challenges of implementing agentic artificial intelligence, the least-understood issue is cost. The providers of AI, such as OpenAI, Google, and Anthropic, have price lists, but none of those listed prices tell users what the final bill will be to actually solve a problem. 

The result, according to a new study of costs from the University of Michigan and collaborating institutions, could be sticker shock: soaring and unpredictable costs of agents.

The study, by lead author Longju Bai of Michigan and collaborators at Stanford University, All Hands AI, Google’s DeepMind unit, Microsoft, and MIT, titled “How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks,” is, according to the authors, “the first systematic study on AI Agent token consumption.”

The study was posted on the arXiv pre-print server.

It is noteworthy for having as its author a prominent Stanford economist who has commented extensively on AI’s impact on productivity, Erik Brynjolfsson.

The top-level finding is that agents consume orders of magnitude more tokens than turn-by-turn, simple, prompt-based chats — think 3,500 times the number of tokens for an agent as for a round of prompts with ChatGPT. 

Also: AI agents are fast, loose, and out of control, MIT study finds

A token is the fundamental unit of information processed by an AI model. It could be a piece of a word, a whole word, or just a punctuation mark, depending on how a model chops data into pieces. 

You might expect agents to cost more in tokens, but the study reveals more alarming facts. Two different models can have wildly different token costs for the same task. And the same model can have different costs each time that it works on the same problem, using as many as twice the number of tokens on one occasion compared to another. 

The worst part is that none of this can be predicted. Agents, Bai and team found, cannot reliably estimate how many tokens they will ultimately consume for a given task. 

“Agentic tasks are uniquely expensive,” they wrote, while more tokens don’t necessarily improve results. “Simply scaling token usage may not lead to higher execution performance,” they wrote, and, “[AI] models systematically underestimate the tokens they need. 

The rising cost and the uncertainty of success are in no way accounted for in today’s price lists from OpenAI and others. The work suggests there is no easy fix to the matter. The best users can do is to set hard limits on agentic computer use, possibly causing agents to halt before completing tasks.

(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

The big picture is that users collectively will have to push back on OpenAI and the other vendors and demand some form of reliable cost estimation and guarantees of task performance. 

We reached out to OpenAI, Google, and Anthropic for comment.

Counting token costs 

To study costs, Bai and team used the open-source agentic AI framework OpenHands, developed by scholars at the University of Illinois Urbana-Champaign and collaborating institutions. They used OpenHands to build agents, which they then tested on the open-source coding benchmark test SWE-Bench. The SWE-Bench tasks are taken from actual GitHub issues. 

Also: AI agents of chaos? New research shows how bots talking to bots can go sideways fast

They first found the relative strengths of models. OpenAI’s ChatGPT 5 and 5.2 “achieve strong accuracy at low cost,” though they are not the most accurate. Anthropic’s Claude Sonnet-4.5 achieved the highest accuracy but at higher token costs. Google’s Gemini-3-Pro was somewhere in the middle. And the Kimi-K2 model from Chinese AI lab Moonshot may have the worst relative mix: the most tokens to achieve the lowest accuracy.

u-michigan-2026-token-efficiency-and-accuracy

University of Michigan

The authors suggested the difference in tokens is based on unique properties of how models are architected: “The gap is not driven by task difficulty or by some models attempting harder problems. Instead, the same task is simply more expensive for some models than others, reflecting a behavioral tendency of the model rather than a property of the problem.”

But the issue is not one of better or worse models because even the same model can take twice as many tokens to solve the same problem from one “run” of the task to the next. 

“The most expensive runs double the token and monetary cost of the least expensive runs,” they observed, “suggesting that the agent’s token consumption has large variances even when working on exactly the same problem.”

u-michigan-2026-max-and-min-token-use-by-various-models

University of Michigan

The lesson is that more tokens don’t necessarily get you better results. “Simply scaling token usage may not lead to higher execution performance,” they wrote.

In fact, the authors found that generally work can get worse the longer an agentic spends on a task. “Accuracy often peaks at intermediate cost and saturates at higher costs,” they observed. “Agent behavior becomes increasingly unstable on more complex tasks.”

Many models seem to search and search to solve a problem even when it’s fruitless. “Models lack a reliable mechanism to recognize when a task is unsolvable and stop early,” wrote Bai and team. “Instead, they continue exploring, retrying, and re-reading context, accumulating cost without progress.”

Unable to predict costs

Those factors make “token usage prediction and agent pricing a fundamentally challenging task,” wrote Bai and team. And, in fact, the bot itself cannot predict when asked to “introspect,” they found.

Bai and team asked each AI agent to predict its tokens using the prompt: “I’ve uploaded a python code repository in the directory example repo. You are a TOKEN ESTIMATION agent. Estimate the token cost to fix the following issue description,” and then the problem description, such as, fixing a bug for a comparison function in code that fails.

What they found is that agents can approximate to a small degree how many tokens will be used, but their predictions tend to be too low

“Models consistently underestimate the tokens they need,” wrote Bai and team. “The bias is especially pronounced for input tokens, whose predictions stay compressed even as real values grow into the millions.”

Watch those inputs

That last point, about input tokens, has a special prominence in the report. Bai and team found that input tokens, such as what’s typed by the human user, and what is retrieved via tools such as database searches, dominate the cost in tokens. The other two types of tokens, the output, which is generated, and the cached tokens held in memory from prior stages, are far less demanding.     

“Strikingly, input tokens, not output tokens, dominate the overall cost in agentic coding.”

The reason is that “agentic workflows accumulate the information from different sources and the same context gets fed into the models repeatedly.” As a result, there is a “dramatically higher input/output ratio” for agentic AI than for single-prompt or multi-prompt AI sessions with a bot.

And, drilling down even further, the most expensive input token factor is when the agent retrieves prior information from memory. “We find that cache reads dominate both raw token volume and dollar cost,” Bai and team wrote. “In every phase, cache-read input tokens are the largest category by a wide margin (Figure 8a), reflecting the cumulative reuse of prior context.”

There will be a reckoning 

Overall, the study results confirm my anecdotal experience with coding agents such as Replit and Lovable, where the meter was constantly running to use the underlying AI models, and I had no sense of what the total cost would be.

What can be done? The authors don’t have many suggestions. One proposal is that even if agents can’t predict the number of tokens, they can make some guesses at a high level, a “coarse-grained” estimate for token cost. “This suggests that agent-driven estimation can potentially support early budget alerts before launching expensive runs, improving cost transparency without overpromising precise token-level accuracy,” they wrote.

I can think of a few other sensible guidelines. 

Since input tokens are the biggest cost element, one should think carefully about what can be controlled at input. The size of prompts is one factor that drives input tokens higher. The context window used with an agent, wider or narrower, affects token count at input. And the number of tools called by the agent, such as databases, will bring lots more input tokens into play. 

Also: Can a newbie really vibe code an app? I tried Cursor and Replit to find out

There’s only so much you can do as a user, however. Something more will have to be done on an industry-wide basis. The problems outlined are clearly those of a young industry, and one where vendors will have to be pushed by users to change practices. 

The lack of transparency as to what an agent might cost to do a task is way too vague for enterprises that need to be able to plan investments in software. The burden is pushed onto the user to run agentic tasks in an experimental capacity over and over in order to get anything like an average cost to use as an estimate for planning purposes.

And the lack of guarantees of success — even after the agent burns through tokens — is the most glaring problem. That means enterprises could waste vast amounts of money just running tokens.

Users collectively are going to have to push back on vendors such as OpenAI, Google, and Anthropic and demand price transparency and some form of guarantee that a task will be completed, or else the entire exercise of agentic AI may be dominated by cost overruns and failed implementations.

Such deep problems are probably already being encountered by early adopters. They may be content to pay such a high cost to be among the first to get an agentic edge. It’s not a situation, however, that can lead to stable, steady use of agentic AI.





Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


Remember those moments when a tech giant throws a curveball, only for the underdog to dodge it with style? That’s exactly what just went down with Anything. For those of you unaware, it’s an AI-powered app builder that lets users whip up mobile and web apps using simple text prompts.

Last week, Apple yanked the app from the App Store, citing its usual guideline around code execution and keeping apps “self-contained.” The move felt like part of a broader side-eye toward so-called “vibe coding” tools, where building software is starting to feel as casual as texting a friend.

Apple pulled the app… and Anything got creative

Instead of backing down, the Anything team went full chaos mode, and in a good way. They rebuilt the core experience inside iMessage, effectively turning a messaging app into an app-building tool. Yes, actual app creation… through texts.

BREAKING: Apple is scared of vibe coding

they removed Anything from the App Store so we moved app building to iMessage

good luck removing this one, Apple pic.twitter.com/QrZ2oRk6ha

— Anything (@anything) April 2, 2026

It didn’t just work, it blew up. The workaround went viral, people loved the ingenuity, and the narrative flipped almost instantly. What started as “Apple said no” quickly turned into “wait, this is actually genius.” Memes followed, timelines filled up, and suddenly it felt like Apple had been outplayed at its own game.

And now, just like that, it’s back

Just days later, Apple quietly brought Anything back to the App Store with a few tweaks, but the core idea remains the same: build apps using simple text prompts, preview them instantly, and ship them straight from a phone. The comeback also feels like a subtle shift in momentum. AI is making creation faster, easier, and way more accessible. And when developers can route around restrictions using something as basic as iMessage, it becomes harder to hold that line.

As AI makes creation effortless, even tightly controlled platforms are being forced to adapt. And if this saga proves anything, it’s that creativity will always find a way around the rules.



Source link