CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution of programming and natural languages on model performance is explored. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into four lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen2.

翻译：大型语言模型在程序合成与理解任务的表示学习中展现出卓越能力。学习到的表示质量似乎由神经缩放定律决定，即模型参数规模与观测数据量的函数关系，同时受限于昂贵的数据量与计算资源。本研究通过统一四个关键组件来提升程序合成任务中LLM的训练效率：（1）模型架构，（2）学习方法，（3）填充采样，以及（4）数据分布。具体而言，在模型架构上，我们尝试将编码器与解码器类模型统一为单一前缀语言模型。在学习方法上，将（i）因果语言建模，（ii）跨度破坏，以及（iii）填充统一为简单学习算法。在填充采样中，我们探讨了“免费午餐”假设的有效性。在数据分布上，研究了编程语言与自然语言混合分布对模型性能的影响。我们针对1B参数规模的LLM开展了系统性实验，从中提炼出四项经验教训以总结探索的失败与成功。最终将提供训练方案，并开源1B、3.7B、7B与16B参数的CodeGen2模型及训练框架：https://github.com/salesforce/CodeGen2。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/