Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\sim73\times$ larger) and closely match the performance of the $\sim 70\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \url{https://huggingface.co/RepoFusion}.
翻译:尽管大型语言模型(LLMs)在编码助手(如GitHub Copilot)中取得了巨大成功,但这些模型难以理解仓库中的上下文(例如导入内容、父类、名称相似的文件等),从而生成不准确的代码补全。当将这些助手用于模型训练期间未见的仓库(如专有软件或进行中的代码项目)时,这种影响尤为显著。近期研究表明,在推理过程中利用仓库上下文具有潜力。在本工作中,我们扩展了这一思路,提出RepoFusion框架,用于训练模型以整合相关的仓库上下文。在单行代码补全实验上,我们的模型在仓库上下文训练下,显著优于大得多的代码模型如CodeGen-16B-multi(约大73倍),并接近约70倍大的StarCoderBase模型(该模型使用填空目标训练)的性能。我们认定这些结果是一项新颖且令人信服的演示,展示了结合仓库上下文训练所能带来的提升。我们进行了广泛的消融研究,以探究框架中上下文类型、上下文数量、上下文长度以及初始化等设计选择的影响。最后,我们发布了Stack-Repo数据集,包含200个采用宽松许可证的Java仓库及近去重文件,并附有三种类型的仓库上下文。此外,我们公开了本工作的代码和训练好的检查点。所发布的资源可在 \url{https://huggingface.co/RepoFusion} 获取。