The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards -- SWE-Bench Lite and SWE-Bench Verified -- have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (79 entries) and Verified (99 entries) leaderboards, analyzing 80 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.
翻译:自动化程序修复(APR)领域的快速发展主要由人工智能的进步所驱动,特别是大语言模型(LLM)和基于智能体的系统。SWE-Bench是近期推出的一个基准测试,旨在通过从12个热门开源Python代码库中挖掘的真实问题与拉取请求来评估基于LLM的修复系统。其公开排行榜——SWE-Bench Lite与SWE-Bench Verified——已成为追踪进展和比较解决方案的核心平台。然而,由于提交过程无需提供详细文档,许多解决方案的架构设计与来源仍不明确。本文首次对SWE-Bench Lite(79项提交)与Verified(99项提交)排行榜的所有提交进行了全面研究,从提交者类型、产品可用性、LLM使用情况和系统架构等多个维度分析了80种独特方法。我们的研究结果表明:专有LLM(尤其是Claude 3.5)占据主导地位;系统设计同时存在智能体与非智能体架构;贡献者群体涵盖从独立开发者到大型科技公司的广泛范围。