SyncMind：衡量协作软件工程中智能体失步恢复能力 (SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering)

Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants -- whether humans or AI agents -- to stay on the same page as their environment evolves. When a collaborator's understanding diverges from the current state -- what we term the out-of-sync challenge -- the collaborator's actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents' capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agent <= 3.33% to Claude-3.5-Sonnet >= 28.18%), their consistently low collaboration willingness (<= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents' resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems. Code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.

翻译：软件工程日益呈现协作化趋势，开发者需要在共享的复杂代码库上协同工作。在共享环境中实现有效协作，要求参与者（无论是人类还是AI智能体）能够随着环境变化保持同步。当协作者的理解与当前状态发生偏离——我们称之为失步挑战——协作者的行为可能失败，从而导致集成问题。本研究提出SyncMind框架，系统性地定义了大型语言模型智能体在协作软件工程中面临的失步问题。基于SyncMind，我们构建了SyncBench基准测试集，该数据集包含24,332个源自21个热门GitHub仓库的真实协作软件工程场景中的智能体失步实例，并配备可执行的验证测试。在SyncBench上的实验揭示了现有LLM智能体能力与局限性的关键发现：除了智能体间存在显著性能差距（从Llama-3.1智能体的≤3.33%到Claude-3.5-Sonnet的≥28.18%），其持续偏低的协作意愿（≤4.86%）表明现有LLM在协作软件工程中存在根本性局限。然而，当协作发生时，其与失步恢复成功率呈正相关。智能体在资源感知型失步恢复中表现出的微小性能差异，进一步揭示了其在资源意识与适应性方面的严重不足，这为未来资源高效的协作系统设计提供了启示。代码与数据已在项目网站开源：https://xhguo7.github.io/SyncMind/。