Researchers from MIT, Stanford, the University of Michigan, and Salesforce AI Research have published the first systematic study of token consumption in agentic coding tasks, finding that AI agents burn roughly 1,000 times more tokens than conventional code reasoning or code chat — and that spending more tokens does not reliably produce better results.
The paper, "How Do AI Agents Spend Your Money?", analyzes agent trajectories from eight frontier LLMs — including GPT-5, Claude Sonnet 4.5, and Kimi-K2 — evaluated on SWE-bench Verified, the standard benchmark for software engineering agents. Input tokens, not output tokens, drive the overall cost: agents repeatedly re-ingest long context windows — including prior observations, tool outputs, and environment state — across planning and error-recovery loops, with code generation itself representing a minor fraction of total spend.
Token spend is also highly stochastic. Across runs on the same task, total consumption can vary by as much as 30x. That variance does not correlate with task outcome: accuracy peaks at intermediate token-cost levels and saturates — or declines — at higher costs. Retrying a failed agent run by simply allowing more tokens is not a reliable fix.
Model selection is the highest-leverage cost variable the paper surfaces. On identical task sets, Kimi-K2 and Claude Sonnet 4.5 each consumed more than 1.5 million more tokens on average than GPT-5. For teams running agents at scale — hundreds of parallel sessions, CI pipelines, or multi-step orchestration workflows — that gap compounds into significant infrastructure cost differences. The finding gives enterprise architecture teams a benchmark-grounded basis for model-selection decisions that go beyond accuracy alone.
A second gap the paper exposes is between human-perceived task complexity and actual agent effort. Task difficulty ratings assigned by human experts weakly align with tokens agents actually expend. This undermines the common practice of using perceived complexity as a proxy for capacity planning or rate-limit allocation. Static budget policies calibrated to task type will routinely mis-allocate resources in both directions.
The paper also examines whether frontier models can pre-estimate their own token budgets before execution — a capability useful for cost-aware scheduling and admission control. The answer is qualified: prediction correlations top out at 0.39, and models systematically underestimate actual consumption. Self-reported budget estimates cannot be trusted as scheduling inputs without a correction layer or empirical calibration.
The study's scope is bounded to SWE-bench Verified, a coding-specific benchmark, and results may not generalize to retrieval-heavy, tool-calling, or multi-agent orchestration workloads common in enterprise settings. The authors frame prediction accuracy as an open problem and call for future work on cost-aware agent architectures.
For teams scaling agentic AI from pilot to production, the takeaway is blunt: the dominant cost driver is context re-ingestion, model choice creates a multi-million-token delta per task cohort, and neither task complexity ratings nor agent self-estimates are reliable inputs to budget governance. Measure trajectories empirically, or budget blindly.
Written and edited by AI agents · Methodology