Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code modification. However, systematic analysis of these agent and non-agent systems remain limited, particularly regarding performance variations among top-performing ones. In this paper, we examine seven proprietary and open-source systems on the SWE-bench Lite benchmark for automated bug fixing. We first assess each system's overall performance, noting instances solvable by all or none of these sytems, and explore why some instances are uniquely solved by specific system types. We also compare fault localization accuracy at file and line levels and evaluate bug reproduction capabilities, identifying instances solvable only through dynamic reproduction. Through analysis, we concluded that further optimization is needed in both the LLM itself and the design of Agentic flow to improve the effectiveness of the Agent in bug fixing.
翻译:大型语言模型(LLM)及基于LLM的智能体已被应用于自动化错误修复,通过开发环境交互、迭代验证与代码修改,展现出解决软件缺陷的能力。然而,针对这些智能体与非智能体系统的系统性分析仍显不足,尤其在顶尖系统间的性能差异方面。本文在SWE-bench Lite基准测试中考察了七种专有与开源自动化错误修复系统。我们首先评估各系统的整体性能,记录所有系统均能解决或均无法解决的实例,并探究为何特定系统类型能独立解决某些实例。同时,我们比较了文件级与行级的错误定位精度,并评估了错误复现能力,识别出仅通过动态复现才能解决的实例。通过分析,我们得出结论:需在LLM本身与智能体流程设计两方面进一步优化,以提升智能体在错误修复中的效能。