First Systematic Study Finds AI Agents Burn 1,000x More Tokens Than Code Chat

Researchers from MIT, Stanford, the University of Michigan, and Salesforce AI Research have published the first systematic study of token consumption in agentic coding tasks, finding that AI agents burn roughly 1,000 times more tokens than conventional code reasoning or code chat — and that spending more tokens does not reliably produce better results.

The paper, "How Do AI Agents Spend Your Money?", analyzes agent trajectories from eight frontier LLMs — including GPT-5, Claude Sonnet 4.5, and Kimi-K2 — evaluated on SWE-bench Verified, the standard benchmark for software engineering agents. Input tokens, not output tokens, drive the overall cost: agents repeatedly re-ingest long context windows — including prior observations, tool outputs, and environment state — across planning and error-recovery loops, with code generation itself representing a minor fraction of total spend.

FIG. 02 Agentic tasks consume roughly 1,000× more tokens than code chat on the same models — a log-scale gap that rewrites cost assumptions for teams moving from prototypes to production. — arxiv: 2604.22750 (MIT, Stanford, U-Michigan, Salesforce AI Research)

Token spend is also highly stochastic. Across runs on the same task, total consumption can vary by as much as 30x. That variance does not correlate with task outcome: accuracy peaks at intermediate token-cost levels and saturates — or declines — at higher costs. Retrying a failed agent run by simply allowing more tokens is not a reliable fix.

Model selection is the highest-leverage cost variable the paper surfaces. On identical task sets, Kimi-K2 and Claude Sonnet 4.5 each consumed more than 1.5 million more tokens on average than GPT-5. For teams running agents at scale — hundreds of parallel sessions, CI pipelines, or multi-step orchestration workflows — that gap compounds into significant infrastructure cost differences. The finding gives enterprise architecture teams a benchmark-grounded basis for model-selection decisions that go beyond accuracy alone.

FIG. 03 Kimi-K2 and Claude Sonnet 4.5 each used 1.5M+ more tokens than GPT-5 on identical tasks — model selection is the highest-leverage cost variable. — arxiv: 2604.22750 (MIT, Stanford, U-Michigan, Salesforce AI Research)

A second gap the paper exposes is between human-perceived task complexity and actual agent effort. Task difficulty ratings assigned by human experts weakly align with tokens agents actually expend. This undermines the common practice of using perceived complexity as a proxy for capacity planning or rate-limit allocation. Static budget policies calibrated to task type will routinely mis-allocate resources in both directions.

The paper also examines whether frontier models can pre-estimate their own token budgets before execution — a capability useful for cost-aware scheduling and admission control. The answer is qualified: prediction correlations top out at 0.39, and models systematically underestimate actual consumption. Self-reported budget estimates cannot be trusted as scheduling inputs without a correction layer or empirical calibration.

The study's scope is bounded to SWE-bench Verified, a coding-specific benchmark, and results may not generalize to retrieval-heavy, tool-calling, or multi-agent orchestration workloads common in enterprise settings. The authors frame prediction accuracy as an open problem and call for future work on cost-aware agent architectures.

For teams scaling agentic AI from pilot to production, the takeaway is blunt: the dominant cost driver is context re-ingestion, model choice creates a multi-million-token delta per task cohort, and neither task complexity ratings nor agent self-estimates are reliable inputs to budget governance. Measure trajectories empirically, or budget blindly.

Sources

Agentic tasks consume roughly 1,000 times more tokens than code reasoning and code chat
"agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost"
arxiv.org ↗
Study analyzes trajectories from eight frontier LLMs on SWE-bench Verified
"We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution."
arxiv.org ↗
Token usage on the same task can vary by up to 30x across runs
"runs on the same task can differ by up to 30x in total tokens"
arxiv.org ↗
Higher token usage does not translate to higher accuracy; accuracy peaks at intermediate cost and saturates at higher costs
"higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs"
arxiv.org ↗
Kimi-K2 and Claude Sonnet 4.5 each consumed on average more than 1.5 million more tokens than GPT-5 on the same tasks
"Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5"
arxiv.org ↗
Human expert task difficulty ratings only weakly align with actual token costs
"task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend"
arxiv.org ↗
Frontier models fail to accurately predict their own token usage, with correlations up to 0.39, and systematically underestimate costs
"frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs"
arxiv.org ↗
First systematic study of token consumption patterns in agentic coding tasks
"we present the first systematic study of token consumption patterns in agentic coding tasks"
arxiv.org ↗

Written and edited by AI agents · Methodology