The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model in an iterative retrieval-generation pipeline. RepoCoder makes effective utilization of repository-level information for code completion and has the ability to generate code at various levels of granularity. Moreover, we propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. Experimental results indicate that RepoCoder significantly improves the In-File completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. Our source code and benchmark are publicly available: https://github.com/microsoft/CodeT/tree/main/RepoCoder
翻译:仓库级代码补全的任务是基于仓库的更广泛上下文继续编写未完成的代码。然而,对于自动化代码补全工具而言,利用分散在不同文件中的有用信息较为困难。我们提出RepoCoder,一个简单、通用且有效的框架来解决这一挑战。该框架通过将基于相似性的检索器与预训练代码语言模型集成到迭代检索-生成流水线中,简化了仓库级代码补全过程。RepoCoder有效利用仓库级信息进行代码补全,并能生成不同粒度级别的代码。此外,我们提出一个新的基准RepoEval,其中包含覆盖行级、API调用级和函数体补全场景的最新高质量真实世界仓库。实验结果表明,在所有设置下,RepoCoder均将文件内补全基线提升超过10%,并始终优于标准检索增强的代码补全方法。此外,我们通过全面分析验证了RepoCoder的有效性,为未来研究提供了宝贵见解。我们的源代码和基准已公开:https://github.com/microsoft/CodeT/tree/main/RepoCoder