Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs will increase the response time of code completion and decrease the developers' productivity. In this paper, we propose a lightweight and effective LLM for code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. (2) Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of LLMs in understanding cross-file contexts. (3) Extensive high-quality data. We establish a rigorous data collection pipeline and consume a total of 1.2 trillion unique tokens for training aiXcoder-7B. This vast volume of data enables aiXcoder-7B to learn a broad distribution of code. We evaluate aiXcoder-7B in five popular code completion benchmarks and a new benchmark collected by this paper. The results show that aiXcoder-7B outperforms the latest six LLMs with similar sizes and even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning aiXcoder-7B as a lightweight and effective LLM for academia and industry. Finally, we summarize three valuable insights for helping practitioners train the next generations of LLMs for code. aiXcoder-7B has been open-souced and gained significant attention. As of the submission date, aiXcoder-7B has received 2,193 GitHub Stars.
翻译:大语言模型(LLMs)已广泛应用于代码补全领域,研究人员正致力于扩大模型规模以提升其准确性。然而,更大的模型会增加代码补全的响应时间,从而降低开发者的工作效率。本文提出了一种名为aiXcoder-7B的轻量级高效代码补全大语言模型。与现有大语言模型相比,aiXcoder-7B在保持较小规模(即70亿参数)的同时实现了更高的代码补全准确率。我们将aiXcoder-7B的优越性归因于三个关键因素:(1)多目标训练。我们采用了三种训练目标,其中包含我们提出的结构化中间填充(SFIM)。SFIM考虑了代码的语法结构,有效提升了大语言模型的代码处理性能。(2)多样化的数据采样策略。这些策略考虑了文件间关联关系,增强了大语言模型理解跨文件上下文的能力。(3)海量高质量数据。我们建立了严格的数据收集流程,共使用1.2万亿个独立词元训练aiXcoder-7B。如此庞大的数据量使aiXcoder-7B能够学习广泛的代码分布。我们在五个主流代码补全基准测试及本文收集的新基准上评估aiXcoder-7B。结果表明,aiXcoder-7B在性能上超越了六个最新发布的同等规模大语言模型,甚至优于四个更大规模的模型(如StarCoder2-15B和CodeLlama-34B),确立了其作为学术界和工业界轻量级高效大语言模型的地位。最后,我们总结了三个有价值的见解,以帮助实践者训练下一代代码大语言模型。aiXcoder-7B已开源并获得广泛关注。截至投稿日,该模型已在GitHub上获得2,193个星标。