Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.
翻译:从较小宽度到较大宽度稳健且有效地扩展模型,通常需要精确调整许多算法和架构细节,例如参数化和优化器的选择。在本研究中,我们通过探究先前工作中关于参数与数据对齐的关键假设,提出了一种关于参数化的新视角,并在更弱的假设和更广泛的优化器集合下推导出新的理论结果。我们广泛的实证研究涵盖了数万个模型的训练,这些模型结合了三种优化器、四种参数化方法、若干对齐假设、十余种学习率以及多达14种模型规模(最高达268亿参数)。我们发现,最佳学习率缩放方案往往会被先前工作中的假设所排除。我们的结果表明,所有参数化方法(而不仅仅是最大更新参数化(muP))都能实现超参数迁移;此外,我们针对标准参数化提出的新颖逐层学习率方案优于muP。最后,我们证明参数化中一个被忽视的方面——Adam中的epsilon参数——必须正确缩放以避免梯度下溢,并提出了Adam-atan2,这是一种新的数值稳定、尺度不变的Adam变体,完全消除了epsilon超参数。