The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards-SWE-Bench Lite and Verified-have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 79 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions-typically open source-also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.
翻译:自动程序修复(APR)的快速发展得益于人工智能的进步,特别是大语言模型(LLM)和基于智能体的系统。SWE-Bench是一个旨在通过从热门开源Python仓库中挖掘的真实问题来评估修复系统的基准测试。其公开排行榜——SWE-Bench Lite和Verified——已成为追踪进展和比较解决方案的核心平台。本文首次对这两个排行榜进行了全面研究,考察了解决方案的提交者、提交背后的产品、所采用的LLM以及方法的开放性。我们分析了提交至Lite排行榜的79个条目和Verified排行榜的133个条目。结果显示,两个排行榜上的大多数条目源自产业界,特别是小型公司和大型上市公司。这些提交通常能取得顶尖成绩,尽管学术贡献——通常是开源的——也保持竞争力。我们还发现专有LLM占据明显主导地位,特别是Claude系列模型,目前两个排行榜上的最先进结果均由Claude 4 Sonnet实现。这些发现为理解SWE-Bench生态系统提供了见解,有助于在未来基于基准测试的研究中引导更高的透明度和多样性。