LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AgentIssue-Bench, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AgentIssue-Bench and reveal their limited effectiveness (i.e., with only 0.67% - 4.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://github.com/alfin06/AgentIssue-Bench.
翻译:基于大语言模型(LLM)的智能体系统正作为一种新兴的软件范式崛起,并已在医疗、机器人、编程等多个领域得到广泛应用。然而,维护此类系统需要投入大量精力,因为它们不可避免地存在缺陷,且需持续演进以满足不断变化的外部需求。因此,自动解决智能体相关问题(即错误报告或功能需求)成为一项关键且具有挑战性的任务。尽管近期软件工程(SE)智能体(如SWE-agent)在解决传统软件系统问题方面展现出潜力,但其在处理与现实世界智能体系统问题(与传统软件存在显著差异)时的实际效能尚不明确。为填补这一空白,我们首先手动分析了201个现实世界中的智能体问题,并识别出常见的智能体问题类别。随后,我们投入500人时构建了AgentIssue-Bench——一个包含50项智能体问题解决任务(每项任务均配备可执行环境及触发失败的测试)的可复现基准。我们进一步在AgentIssue-Bench上评估了当前最先进的软件工程智能体,并揭示了其有限的解决能力(即仅0.67%至4.67%的解决率)。这些结果凸显了维护智能体系统相较于传统软件所面临的独特挑战,强调需要进一步研究以开发更先进的软件工程智能体来解决智能体相关问题。数据与代码公开于https://github.com/alfin06/AgentIssue-Bench。