Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parameterization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parameterization achieves higher generalization performance in feature learning. In particular, it enables us to transfer the hyperparameters across models with different widths.
翻译:二阶优化已被开发用于加速深度神经网络的训练,并正被应用于日益大规模化的模型。在本研究中,面向更大规模的训练,我们确定了一种特定的二阶优化参数化方法,该方法即使在网络宽度显著增加时也能以稳定方式促进特征学习。受最大更新参数化的启发,我们考虑了梯度的一步更新,并揭示了包括随机初始化、学习率和阻尼项在内的超参数的适当尺度。我们的方法涵盖两种主要的二阶优化算法——K-FAC和Shampoo,并证明我们的参数化在特征学习中实现了更高的泛化性能。特别地,它使我们能够将超参数迁移到不同宽度的模型中。