aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

Large Language Models (LLMs) have shown promising results in repository-level code completion, which completes code based on the in-file and cross-file context of a repository. The cross-file context typically contains different types of information (e.g., relevant APIs and similar code) and is lengthy. In this paper, we found that LLMs struggle to fully utilize the information in the cross-file context. We hypothesize that one of the root causes of the limitation is the misalignment between pre-training (i.e., relying on nearby context) and repo-level code completion (i.e., frequently attending to long-range cross-file context). To address the above misalignment, we propose Code Long-context Alignment - COLA, a purely data-driven approach to explicitly teach LLMs to focus on the cross-file context. Specifically, COLA constructs a large-scale repo-level code completion dataset - COLA-132K, where each sample contains the long cross-file context (up to 128K tokens) and requires generating context-aware code (i.e., cross-file API invocations and code spans similar to cross-file context). Through a two-stage training pipeline upon COLA-132K, LLMs learn the capability of finding relevant information in the cross-file context, thus aligning LLMs with repo-level code completion. We apply COLA to multiple popular LLMs (e.g., aiXcoder-7B) and extensive experiments on COLA-132K and a public benchmark - CrossCodeEval. Our experiments yield the following results. 1) Effectiveness. COLA substantially improves the performance of multiple LLMs in repo-level code completion. For example, it improves aiXcoder-7B by up to 19.7% in exact match. 2) Generalizability. The capability learned by COLA can generalize to new languages. 3) Enhanced Context Utilization Capability. We design two probing experiments, which show COLA improves the capability of LLMs in utilizing the information in cross-file context.

翻译：大语言模型在仓库级代码补全任务中展现出有前景的结果，该任务基于仓库内的文件内上下文和跨文件上下文来生成代码。跨文件上下文通常包含不同类型的信息（如相关API和相似代码）且篇幅较长。本文发现，大语言模型难以充分利用跨文件上下文中的信息。我们认为该局限性的根本原因之一在于预训练阶段（即依赖邻近上下文）与仓库级代码补全任务（即频繁关注长距离跨文件上下文）之间的错位。为解决上述错位问题，我们提出代码长上下文对齐方法——COLA，这是一种纯数据驱动的方法，旨在显式教导大语言模型关注跨文件上下文。具体而言，COLA构建了大规模仓库级代码补全数据集——COLA-132K，其中每个样本均包含长跨文件上下文（最长可达128K词元），并要求生成上下文感知代码（即跨文件API调用及与跨文件上下文相似的代码片段）。通过在COLA-132K数据集上进行两阶段训练流程，大语言模型能够学习在跨文件上下文中定位相关信息的能力，从而实现大语言模型与仓库级代码补全任务的对齐。我们将COLA应用于多个主流大语言模型（如aiXcoder-7B），并在COLA-132K数据集及公开基准测试CrossCodeEval上进行了广泛实验。实验结果如下：1）有效性。COLA显著提升了多个大语言模型在仓库级代码补全任务中的性能，例如将aiXcoder-7B的精确匹配率最高提升19.7%。2）泛化性。COLA习得的能力可迁移至新编程语言。3）增强的上下文利用能力。我们设计了两项探测实验，证明COLA提升了大语言模型利用跨文件上下文信息的能力。