Large Language Models (LLMs) specializing in code generation (which are also often referred to as code LLMs), e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language or code explanations. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. Furthermore, there is a lack of thorough prior studies on the LLM pretraining strategy that mixes code and natural language. In this work, we propose a pretraining strategy to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes two phases of training with appropriately adjusted code/language ratios. The resulting model, Crystal, demonstrates remarkable capabilities in both domains. Specifically, it has natural language and coding performance comparable to that of Llama 2 and Code Llama, respectively. Crystal exhibits better data efficiency, using 1.4 trillion tokens compared to the more than 2 trillion tokens used by Llama 2 and Code Llama. We verify our pretraining strategy by analyzing the training process and observe consistent improvements in most benchmarks. We also adopted a typical application adaptation phase with a code-centric data mixture, only to find that it did not lead to enhanced performance or training efficiency, underlining the importance of a carefully designed data recipe. To foster research within the community, we commit to open-sourcing every detail of the pretraining, including our training datasets, code, loggings and 136 checkpoints throughout the training.
翻译:专门用于代码生成的大语言模型(通常也称为代码大语言模型),例如StarCoder和Code Llama,在各种软件开发场景中扮演着日益关键的角色。对于许多特定应用(例如使用自然语言进行代码片段检索或代码解释)而言,代码大语言模型同时具备代码生成和自然语言能力也至关重要。语言能力与编码技能获取之间复杂的相互作用,使得开发强大的代码大语言模型变得困难。此外,目前缺乏关于混合代码与自然语言的大语言模型预训练策略的深入前期研究。在本工作中,我们提出了一种预训练策略,旨在增强单个大语言模型内部自然语言能力与编码能力的融合。具体而言,该策略包含两个训练阶段,并适当调整了代码与自然语言数据的比例。由此得到的模型Crystal,在两个领域均展现出卓越的能力。具体来说,其自然语言性能与Llama 2相当,编码性能与Code Llama相当。Crystal表现出更好的数据效率,仅使用了1.4万亿个标记,而Llama 2和Code Llama使用了超过2万亿个标记。我们通过分析训练过程验证了我们的预训练策略,并观察到在大多数基准测试中性能得到了一致的提升。我们还尝试采用了一个以代码为中心的数据混合的典型应用适应阶段,结果发现它并未带来性能或训练效率的提升,这突显了精心设计数据配方的重要性。为促进社区内的研究,我们承诺开源预训练的每一个细节,包括我们的训练数据集、代码、日志以及整个训练过程中的136个检查点。