RepoFusion: Training Code Models to Understand Your Repository

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\sim73\times$ larger) and closely match the performance of the $\sim 70\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \url{https://huggingface.co/RepoFusion}.

翻译：尽管大型语言模型（LLMs）在编码助手（如GitHub Copilot）中取得了巨大成功，但这些模型难以理解仓库中的上下文（例如导入内容、父类、名称相似的文件等），从而生成不准确的代码补全。当将这些助手用于模型训练期间未见的仓库（如专有软件或进行中的代码项目）时，这种影响尤为显著。近期研究表明，在推理过程中利用仓库上下文具有潜力。在本工作中，我们扩展了这一思路，提出RepoFusion框架，用于训练模型以整合相关的仓库上下文。在单行代码补全实验上，我们的模型在仓库上下文训练下，显著优于大得多的代码模型如CodeGen-16B-multi（约大73倍），并接近约70倍大的StarCoderBase模型（该模型使用填空目标训练）的性能。我们认定这些结果是一项新颖且令人信服的演示，展示了结合仓库上下文训练所能带来的提升。我们进行了广泛的消融研究，以探究框架中上下文类型、上下文数量、上下文长度以及初始化等设计选择的影响。最后，我们发布了Stack-Repo数据集，包含200个采用宽松许可证的Java仓库及近去重文件，并附有三种类型的仓库上下文。此外，我们公开了本工作的代码和训练好的检查点。所发布的资源可在 \url{https://huggingface.co/RepoFusion} 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日