How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

翻译：AI代理在复杂人类工作流程中的广泛应用正推动LLM代币消耗的快速增长。当代理被部署到需要大量代币的任务中时，自然会产生三个问题：（1）AI代理将代币消耗在何处？（2）哪些模型具有更高的代币效率？（3）代理能否在执行任务前预测其代币使用量？本文首次系统研究了代理型编码任务中的代币消耗模式。我们分析了八个前沿LLM在SWE-bench Verified上的轨迹，并评估了模型在执行任务前预测自身代币成本的能力。研究发现：（1）代理型任务具有独特的昂贵性，其代币消耗量是代码推理和代码对话的1000倍以上，且总成本主要来自输入代币而非输出代币；（2）代币使用高度可变且本质上是随机的：同一任务的多次运行，总代币消耗量差异可达30倍，且更高的代币使用量并不能转化为更高的准确率；相反，准确率通常在中位数成本处达到峰值，并在更高成本时趋于饱和；（3）不同模型的代币效率差异显著：在相同任务上，Kimi-K2和Claude-Sonnet-4.5平均消耗的代币数量比GPT-5多出150万以上；（4）人类专家评定的任务难度与实际代币成本之间的关联性较弱，揭示了人类感知的复杂度与代理实际投入的计算成本之间存在根本性差距；（5）前沿模型无法准确预测自身的代币使用量（相关性较弱至中等，最高仅为0.39），且系统性地低估了实际代币成本。本研究为AI代理的经济性提供了新见解，并可激发该方向的未来研究。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

代码即代理基础设施：迈向可执行、可验证、有状态的AI代理系统

专知会员服务

17+阅读 · 5月20日

AI生成代码缺陷综述

专知会员服务

17+阅读 · 2025年12月8日

《信息战中基于大语言模型的AI代理红蓝队对抗沙盒方法：探索反信息、提示注入与AI素养中的人类控制》最新报告

专知会员服务

27+阅读 · 2025年5月29日

AI在医疗中的安全挑战

专知会员服务

19+阅读 · 2024年10月5日