Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
翻译:语言模型的规模化带来了前所未有的性能提升,但对于训练动态如何随模型增大而变化的机理仍知之甚少。不同规模的语言模型在预训练过程中是如何学习的?为什么更大的语言模型会展现出更理想的行为?本文分析了不同参数规模的OPT模型(Zhang等人,2022)——从1.25亿到1750亿参数——在下一词元预测、序列级生成以及下游任务中的中间训练检查点。我们发现:1)在给定困惑度且与模型规模无关的条件下,训练词元中相似子集呈现损失下降最显著,其余词元则停滞或表现出双下降现象;2)训练初期,所有模型均学习降低包含幻觉的语法序列的困惑度,小模型会停滞在此次优分布,而大模型最终学会为这些序列分配更低的概率;3)困惑度是BIG-Bench中74个多项选择任务上上下文学习性能的强预测因子,且此关系与模型规模无关。综合这些结果说明,困惑度比模型规模或训练计算量更能预测模型行为。