Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

LLM-based Multi-Agent (LLM-MA) systems are increasingly applied to automate complex software engineering tasks such as requirements engineering, code generation, and testing. However, their operational efficiency and resource consumption remain poorly understood, hindering practical adoption due to unpredictable costs and environmental impact. To address this, we conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC), aiming to understand where tokens are consumed across distinct software engineering activities. We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model, mapping its internal phases to distinct development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) to create a standardized evaluation framework. We then quantify and compare token distribution (input, output, reasoning) across these stages. Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens. Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%, providing empirical evidence for potentially significant inefficiencies in agentic collaboration. Our results suggest that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. Our novel methodology can help practitioners predict expenses and optimize workflows, and it directs future research toward developing more token-efficient agent collaboration protocols.

翻译：基于大型语言模型的多智能体（LLM-MA）系统正日益应用于自动化复杂的软件工程任务，如需求工程、代码生成和测试。然而，其运行效率和资源消耗情况仍鲜为人知，由于不可预测的成本和环境影响，阻碍了实际应用。为解决此问题，我们对软件开发生命周期（SDLC）中一个LLM-MA系统的代币消耗模式进行了分析，旨在理解代币在不同软件工程活动中的消耗情况。我们分析了ChatDev框架使用GPT-5推理模型执行30个软件开发任务的执行轨迹，将其内部阶段映射到不同的开发阶段（设计、编码、代码补全、代码审查、测试和文档编制），以创建一个标准化的评估框架。随后，我们量化并比较了这些阶段间的代币分布（输入、输出、推理）。我们的初步发现表明，迭代的代码审查阶段占据了代币消耗的主要部分，平均占代币总数的59.4%。此外，我们观察到输入代币始终构成消耗的最大份额，平均占53.9%，这为自主协作中可能存在的显著低效性提供了经验证据。我们的结果表明，自主软件工程的主要成本不在于初始代码生成，而在于自动化的精炼与验证。我们提出的新方法可以帮助从业者预测开销并优化工作流程，并将未来的研究导向开发更具代币效率的智能体协作协议。