Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream BLEU score with good accuracy using a log-law. However, there are also cases where moderate misalignment causes the BLEU score to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these observations, we provide new practical insights for choosing appropriate pretraining data.
翻译:缩放定律为大型语言模型(LLM)的设计提供了重要指导。现有工作主要关注预训练(上游)损失的缩放规律。然而,在迁移学习场景中,LLM先在无监督数据集上预训练,再针对下游任务进行微调,我们通常更关心下游性能。本文研究了迁移学习场景下的缩放行为,其中LLM针对机器翻译任务进行微调。具体而言,我们探究了预训练数据的选择及其规模如何影响下游性能(翻译质量),并通过两个指标进行评估:下游交叉熵和BLEU分数。实验表明,微调数据集的大小以及预训练数据与下游数据之间的分布对齐程度显著影响缩放行为。在充分对齐的情况下,随着预训练数据量的增加,下游交叉熵和BLEU分数均单调提升。在此类情形下,我们证明可以利用对数律以较高精度预测下游BLEU分数。然而,也存在适度错配导致BLEU分数随预训练数据增加而波动或下降的情况,尽管下游交叉熵呈单调改善趋势。通过分析这些观测结果,我们为选择合适的预训练数据提供了新的实用见解。