Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.
翻译:Kaplan等人与Hoffmann等人分别提出了计算预算函数下最优模型规模的重要缩放定律,但这些定律给出的预测结果存在显著差异。我们通过在两个数据集(OpenWebText2和RefinedWeb)上复现Kaplan缩放定律,识别出导致差异的三个因素:末层计算成本、预热训练时长以及规模相关的优化器调参。修正这些因素后,我们得到了与Hoffmann等人(即"Chinchilla")缩放定律高度吻合的结果。与Hoffmann等人的假设相反,我们发现精细的学习率衰减策略对其缩放定律的有效性并非必要条件。作为次要研究成果,我们推导出最优学习率与批大小的缩放定律,并发现较低批大小下调整AdamW优化器的$\beta_2$参数至关重要。