Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages.
翻译:仓库级代码补全对于现有代码大语言模型(code LLMs)而言仍是一项具有挑战性的任务,原因在于模型对仓库特定上下文和领域知识的理解有限。尽管基于检索增强生成(RAG)的方法通过检索相关代码片段作为跨文件上下文展现出潜力,但其存在两个根本问题:检索过程中查询与目标代码之间的语义对齐偏差,以及现有检索方法无法有效利用推理信息。针对这些挑战,我们提出AlignCoder——一种仓库级代码补全框架,该框架引入了查询增强机制和基于强化学习的检索器训练方法。我们的方法生成多个候选补全结果以构建增强查询,从而弥合初始查询与目标代码之间的语义鸿沟。此外,我们采用强化学习训练AlignRetriever,使其能够学习利用增强查询中的推理信息以实现更精准的检索。我们在两个广泛使用的基准(CrossCodeEval和RepoEval)上,基于五种骨干代码LLM对AlignCoder进行评估,结果显示在CrossCodeEval基准上EM分数相较基线方法提升18.1%。实验结果表明,该框架能够取得卓越性能,并在多种代码LLM及编程语言上展现出高泛化能力。