Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere $4.8$% and $1.7$% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
翻译:语言模型的发展已超越了对其有效评估的能力,但对其未来而言,研究其能力前沿至关重要。我们认为真实世界的软件工程是评估下一代语言模型的丰富、可持续且具有挑战性的测试平台。为此,我们引入SWE-bench评估框架,包含来自12个热门Python代码库中真实GitHub问题及其对应拉取请求的2,294个软件工程问题。给定代码库及待解决问题描述后,语言模型需通过编辑代码库来解决该问题。解决SWE-bench中的问题通常需要跨多个函数、类甚至文件协调变更,要求模型与执行环境交互、处理极长上下文并执行远超传统代码生成的复杂推理。评估表明,无论是当前最先进的专有模型还是我们微调的SWE-Llama模型,都仅能解决最简单的问题。即使配备检索器,Claude 2和GPT-4也仅能分别解决4.8%和1.7%的实例。SWE-bench的进展代表着语言模型向更实用、更智能、更自主方向迈出的步伐。