The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
翻译:AI代理在复杂人类工作流中的广泛应用正推动大语言模型令牌消耗的快速增长。当代理被部署于需要大量令牌的任务时,自然会产生三个关键问题:(1)AI代理将令牌消耗在何处?(2)哪些模型更具令牌效率?(3)代理能否在执行任务前预测其令牌用量?本文首次系统研究了自动化编程任务中的令牌消耗模式。我们分析了八个前沿LLM在SWE-bench Verified基准上的执行轨迹,并评估了模型在执行任务前预测自身令牌成本的能力。研究发现:(1)代理任务具有独特的昂贵性,其令牌消耗量是代码推理与代码对话任务的1000倍以上,且输入令牌而非输出令牌主导总成本;(2)令牌用量具有高度变异性和固有随机性:相同任务的多次运行总令牌差异可达30倍,且更高的令牌消耗并不对应更高的准确率——准确率通常在中等成本时达到峰值,高成本时趋于饱和;(3)模型在令牌效率上差异显著:在相同任务中,Kimi-K2和Claude-Sonnet-4.5平均比GPT-5多消耗超过150万令牌;(4)人类专家评定的任务难度与实际令牌成本仅呈现弱相关性,揭示了人类感知的复杂度与代理实际计算开销之间的根本鸿沟;(5)前沿模型无法准确预测自身令牌用量(呈弱至中等相关性,最高仅0.39),且系统性低估实际令牌成本。本研究为AI代理的经济学提供了新见解,并有望启发该领域的未来研究。