As language models scale up, it becomes increasingly expensive to verify research ideas because conclusions on small models do not trivially transfer to large ones. A possible solution is to establish a generic system that directly predicts some metrics for large models solely based on the results and hyperparameters from small models. Existing methods based on scaling laws require hyperparameter search on the largest models, which is impractical with limited resources. We address this issue by presenting our discoveries indicating that Maximal Update parametrization (muP) enables accurate fitting of scaling laws for hyperparameters close to common loss basins, without any search. Thus, different models can be directly compared on large scales with loss prediction even before the training starts. We propose a new paradigm as a first step towards reliable academic research for any model scale without heavy computation. Code will be publicly available shortly.
翻译:随着语言模型规模扩大,验证研究想法的成本日益高昂,因为小模型上的结论并不能简单迁移至大模型。可行的解决方案是建立通用系统,仅基于小模型的实验结果与超参数,直接预测大模型的指标。现有基于缩放定律的方法需要在大模型上搜索超参数,这在资源受限时难以实现。我们通过研究发现,最大更新参数化(muP)能够使缩放定律对接近常见损失谷的超参数实现精确拟合,且无需任何搜索。因此,即使在训练开始前,不同规模的模型也可通过损失预测进行直接比较。我们提出了一种新范式,作为迈向无需庞大计算资源、适用于任意规模模型的可靠学术研究的第一步。代码将很快开源。