Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.
翻译:大型语言模型(LLMs)极大地推动了代码自动补全系统的发展,有望显著提升开发者的工作效率。然而,现有基准测试主要聚焦于单文件任务,缺乏对更复杂的、现实世界中多文件编程场景的评估。为填补这一空白,我们提出了RepoBench——一个专为评估仓库级代码自动补全系统设计的新型基准测试。RepoBench支持Python和Java两种语言,包含三个相互关联的评估任务:RepoBench-R(检索)、RepoBench-C(代码补全)和RepoBench-P(流水线)。这三个任务分别衡量系统从其他文件中检索最相关代码片段作为跨文件上下文的能力、结合跨文件上下文与文件内上下文预测下一行代码的能力,以及执行需要同时运用检索和下一行预测的复合任务的能力。RepoBench旨在促进更全面的性能比较,并推动自动补全系统的持续改进。RepoBench已在https://github.com/Leolty/repobench 公开提供。