Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of Adam. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.
翻译:尽管在模型与数据集缩放下的最优超参数迁移方面已取得进展,但尚未建立统一的理论解释原则。针对Adam与Scion优化器,我们发现模型与数据集规模的联合最优缩放取决于单一不变量:输出层的算子范数。在参数量高达13亿、训练令牌数达1380亿的模型实验中,最优学习率/批次大小组合$(η^{\ast}, B^{\ast})$始终对应相同的算子范数值——我们将此现象称为范数迁移。该恒定范数条件是必要而非充分的:虽然对于每个数据集规模,存在多个$(η, B)$达到最优范数,但仅有一组唯一的$(η^{\ast}, B^{\ast})$能实现最佳损失。作为充分条件,我们首次测量了Scion优化器中$(η^{\ast}, B^{\ast})$随数据集规模的缩放规律,并发现其缩放规则与Adam保持一致。按层组调整学习率可进一步提升模型性能,其中输出层最为敏感,而隐藏层则受益于较低的学习率。我们提供了关于范数引导最优缩放的实践洞见,并开源了分布式Scion(Disco)实现及两千余次实验日志,以支持大规模语言模型训练动态的研究。