Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages.
翻译:仓库级代码补全对于现有代码大语言模型(code LLM)而言仍是一项具有挑战性的任务,这主要源于其对仓库特定上下文和领域知识的理解有限。尽管检索增强生成(RAG)方法通过检索相关代码片段作为跨文件上下文已展现出潜力,但其存在两个根本性问题:检索过程中查询与目标代码之间的错位,以及现有检索方法无法有效利用推理信息。为应对这些挑战,本文提出AlignCoder——一个仓库级代码补全框架,该框架引入了查询增强机制与基于强化学习的检索器训练方法。我们的方法通过生成多个候选补全结果来构建增强查询,从而弥合初始查询与目标代码之间的语义鸿沟。此外,我们采用强化学习训练AlignRetriever,使其学习利用增强查询中的推理信息以实现更精准的检索。我们在两个广泛使用的基准数据集(CrossCodeEval与RepoEval)上对AlignCoder进行了评估,覆盖五种骨干代码LLM。实验结果表明,在CrossCodeEval基准上,我们的方法相比基线在EM分数上实现了18.1%的提升。这些结果证明,我们的框架不仅取得了优越的性能,而且在不同的代码LLM与编程语言间展现出高度的泛化能力。