The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.
翻译:当前大型语言模型(LLM)开发的主流范式是先预训练一个基础模型,随后通过进一步训练以提升其性能与模型行为。然而,超参数优化与缩放定律的研究主要基于基础模型验证损失的角度,忽视了模型的下游适应能力。本研究从模型可塑性的视角探讨预训练过程,即基础模型通过微调成功适应下游任务的能力。我们聚焦于权重衰减——预训练期间的关键正则化参数——的作用。通过系统实验,我们证明采用较大权重衰减值训练的模型具有更强的可塑性,这意味着它们在下游任务微调后能获得更大的性能提升。这一现象可能导致反直觉的权衡:预训练后表现较差的基础模型在微调后可能表现更优。对权重衰减影响模型行为机制的进一步研究表明,它能促进线性可分离表征的形成,对注意力矩阵进行正则化,并减少训练数据上的过拟合。总之,本工作证明了在超参数优化中需要采用超越交叉熵损失的评价指标的重要性,并揭示单一优化超参数在塑造模型行为中所发挥的多重作用。