The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
翻译:AI代理在复杂人类工作流程中的广泛应用正推动LLM代币消耗的快速增长。当代理被部署到需要大量代币的任务中时,自然会产生三个问题:(1)AI代理将代币消耗在何处?(2)哪些模型具有更高的代币效率?(3)代理能否在执行任务前预测其代币使用量?本文首次系统研究了代理型编码任务中的代币消耗模式。我们分析了八个前沿LLM在SWE-bench Verified上的轨迹,并评估了模型在执行任务前预测自身代币成本的能力。研究发现:(1)代理型任务具有独特的昂贵性,其代币消耗量是代码推理和代码对话的1000倍以上,且总成本主要来自输入代币而非输出代币;(2)代币使用高度可变且本质上是随机的:同一任务的多次运行,总代币消耗量差异可达30倍,且更高的代币使用量并不能转化为更高的准确率;相反,准确率通常在中位数成本处达到峰值,并在更高成本时趋于饱和;(3)不同模型的代币效率差异显著:在相同任务上,Kimi-K2和Claude-Sonnet-4.5平均消耗的代币数量比GPT-5多出150万以上;(4)人类专家评定的任务难度与实际代币成本之间的关联性较弱,揭示了人类感知的复杂度与代理实际投入的计算成本之间存在根本性差距;(5)前沿模型无法准确预测自身的代币使用量(相关性较弱至中等,最高仅为0.39),且系统性地低估了实际代币成本。本研究为AI代理的经济性提供了新见解,并可激发该方向的未来研究。