Recent years have witnessed the deployment of code language models (LMs) in various code intelligence tasks such as code completion. Yet, it is challenging for pre-trained LMs to generate correct completions in private repositories. Previous studies retrieve cross-file context based on import relations or text similarity, which is insufficiently relevant to completion targets. In this paper, we propose a dataflow-guided retrieval augmentation approach, called DraCo, for repository-level code completion. DraCo parses a private repository into code entities and establishes their relations through an extended dataflow analysis, forming a repo-specific context graph. Whenever triggering code completion, DraCo precisely retrieves relevant background knowledge from the repo-specific context graph and generates well-formed prompts to query code LMs. Furthermore, we construct a large Python dataset, ReccEval, with more diverse completion targets. Our experiments demonstrate the superior accuracy and applicable efficiency of DraCo, improving code exact match by 3.43% and identifier F1-score by 3.27% on average compared to the state-of-the-art approach.
翻译:近年来,代码语言模型(LMs)已广泛应用于代码智能任务,如代码补全。然而,对于预训练的语言模型而言,在私有仓库中生成正确的补全内容仍然具有挑战性。先前的研究基于导入关系或文本相似性检索跨文件上下文,但这些方法与补全目标的相关性不足。本文提出一种数据流引导的检索增强方法,称为DraCo,用于仓库级代码补全。DraCo将私有仓库解析为代码实体,并通过扩展的数据流分析建立实体间的关系,从而形成一个仓库特定的上下文图。每当触发代码补全时,DraCo会从仓库特定的上下文图中精确检索相关的背景知识,并生成格式良好的提示来查询代码语言模型。此外,我们构建了一个大型Python数据集ReccEval,其中包含更多样化的补全目标。实验结果表明,DraCo在准确性和适用效率方面均表现优异,与最先进的方法相比,代码精确匹配率平均提高了3.43%,标识符F1分数平均提高了3.27%。