The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap between retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research.
翻译:仓库级代码补全任务旨在基于仓库的更广泛上下文,继续编写未完成的代码。然而,对于自动化代码补全工具而言,利用分散在不同文件中的有用信息颇具挑战。我们提出RepoCoder——一种简单、通用且有效的框架来解决这一难题。该框架通过融合基于相似度的检索器与预训练代码语言模型,简化了仓库级代码补全流程,从而有效利用仓库级信息进行代码补全,并生成不同粒度的代码。此外,RepoCoder采用了一种新颖的迭代检索-生成范式,弥合了检索上下文与预期补全目标之间的差距。我们还提出了新基准RepoEval,包含覆盖行级、API调用及函数体补全场景的最新高质量真实仓库。我们通过不同代码检索器与生成器的组合测试了RepoCoder的性能。实验结果表明,RepoCoder在所有设置下均将零样本代码补全基线提升了超过10%,并持续优于传统的检索增强代码补全方法。此外,我们通过综合分析验证了RepoCoder的有效性,为未来研究提供了宝贵见解。