With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this gap, we introduce DeepCommit, an agentic pipeline that reconstructs verifiable Milestone DAGs from noisy commit logs, where milestones are defined as functionally cohesive development goals. These executable sequences enable EvoClaw, a novel benchmark that requires agents to sustain system integrity and limit error accumulation, dimensions of long-term software evolution largely missing from current benchmarks. Our evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from >80% on isolated tasks to at most 38% in continuous settings, exposing agents' profound struggle with long-term maintenance and error propagation.
翻译:随着AI代理越来越多地作为长期运行系统部署,自主构建并持续进化定制软件以实现动态环境中的交互变得至关重要。然而,现有基准测试仅在孤立的单次编程任务上评估代理,忽视了现实世界软件进化中固有的时间依赖性和技术债务。为填补这一空白,我们提出了DeepCommit,一个从嘈杂的提交日志中重建可验证里程碑有向无环图(DAG)的代理流水线,其中里程碑被定义为功能内聚的开发目标。这些可执行序列支撑了EvoClaw——一个新颖的基准测试,要求代理维护系统完整性并限制错误累积,而这些长期软件进化的维度在现有基准测试中严重缺失。我们对跨4个代理框架的12个前沿模型的评估揭示了一个关键弱点:整体性能得分从孤立任务上的>80%显著下降至连续设置下的最高38%,暴露了代理在长期维护和错误传播中的深刻困境。