Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution of programming and natural languages on model performance is explored. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into four lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen2.
翻译:大型语言模型在程序合成与理解任务的表示学习中展现出卓越能力。学习到的表示质量似乎由神经缩放定律决定,即模型参数规模与观测数据量的函数关系,同时受限于昂贵的数据量与计算资源。本研究通过统一四个关键组件来提升程序合成任务中LLM的训练效率:(1)模型架构,(2)学习方法,(3)填充采样,以及(4)数据分布。具体而言,在模型架构上,我们尝试将编码器与解码器类模型统一为单一前缀语言模型。在学习方法上,将(i)因果语言建模,(ii)跨度破坏,以及(iii)填充统一为简单学习算法。在填充采样中,我们探讨了“免费午餐”假设的有效性。在数据分布上,研究了编程语言与自然语言混合分布对模型性能的影响。我们针对1B参数规模的LLM开展了系统性实验,从中提炼出四项经验教训以总结探索的失败与成功。最终将提供训练方案,并开源1B、3.7B、7B与16B参数的CodeGen2模型及训练框架:https://github.com/salesforce/CodeGen2。