Scaling Laws for Downstream Task Performance of Large Language Models

from arxiv, Published at the International Conference on Learning Representations (ICLR) 2025, with title: "Scaling Laws for Downstream Task Performance in Machine Translation"

Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data.

翻译：缩放规律为大型语言模型（LLM）的设计提供了重要指导。现有研究主要集中于预训练（上游）损失的缩放规律。然而，在迁移学习场景中，即LLM先在无监督数据集上进行预训练，然后在下游任务上进行微调，我们通常也关心下游性能。在本工作中，我们研究了迁移学习环境下的缩放行为，其中LLM针对机器翻译任务进行微调。具体而言，我们探究了预训练数据的选择及其规模如何影响下游性能（翻译质量），评估指标包括：下游交叉熵以及BLEU和COMET分数等翻译质量指标。我们的实验表明，微调数据集的规模以及预训练数据与下游数据之间的分布对齐程度显著影响缩放行为。在充分对齐的情况下，下游交叉熵和翻译质量分数均随着预训练数据的增加而单调提升。在此类情况下，我们证明可以使用对数定律以较高准确度预测下游翻译质量指标。然而，也存在某些情况，中等程度的错位会导致下游翻译分数随预训练数据增加而波动或下降，而下游交叉熵却持续单调改善。通过分析这些现象，我们为选择合适的预训练数据提供了新的实践见解。