The performance of repository-level code completion depends upon the effective leverage of both general and repository-specific knowledge. Despite the impressive capability of code LLMs in general code completion tasks, they often exhibit less satisfactory performance on repository-level completion due to the lack of repository-specific knowledge in these LLMs. To address this problem, we propose GraphCoder, a retrieval-augmented code completion framework that leverages LLMs' general code knowledge and the repository-specific knowledge via a graph-based retrieval-generation process. In particular, GraphCoder captures the context of completion target more accurately through code context graph (CCG) that consists of control-flow, data- and control-dependence between code statements, a more structured way to capture the completion target context than the sequence-based context used in existing retrieval-augmented approaches; based on CCG, GraphCoder further employs a coarse-to-fine retrieval process to locate context-similar code snippets with the completion target from the current repository. Experimental results demonstrate both the effectiveness and efficiency of GraphCoder: Compared to baseline retrieval-augmented methods, GraphCoder achieves higher exact match (EM) on average, with increases of +6.06 in code match and +6.23 in identifier match, while using less time and space.
翻译:仓库级代码补全的性能取决于对通用知识与仓库特定知识的有效利用。尽管代码大语言模型在通用代码补全任务中展现出强大能力,但由于缺乏仓库特定知识,这些模型在仓库级补全任务上的表现往往不尽如人意。为解决该问题,我们提出GraphCoder——一种通过基于图的检索-生成流程来融合代码大语言模型的通用代码知识与仓库特定知识的检索增强型代码补全框架。具体而言,GraphCoder通过代码上下文图(CCG)更精准地捕获补全目标的上下文信息:该图包含控制流、代码语句间的数据依赖与控制依赖,相较于现有检索增强方法中基于序列的上下文表示,这种结构化方式能更有效地捕获补全目标上下文;基于CCG,GraphCoder进一步采用由粗到精的检索流程,从当前仓库中定位与补全目标上下文相似的代码片段。实验结果表明GraphCoder兼具高效性与有效性:与基线检索增强方法相比,GraphCoder在平均精确匹配(EM)上取得更高性能,代码匹配提升+6.06,标识符匹配提升+6.23,同时消耗更少的时间与空间资源。