Accelerated training algorithms, such as adaptive learning rates and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches and the question of how to properly combine regularization with preconditioning. In this paper, we address these challenges using the theory of preconditioning as follows: (1) We explain how preconditioning with AdaGrad, RMSProp, and Adam accelerates training; (2) We explore the interaction between regularization and preconditioning, outlining different options for selecting the variables for regularization, and in particular we discuss how to implement that for the gradient regularization; and (3) We demonstrate how normalization methods accelerate training by improving Hessian conditioning, and discuss how this perspective can lead to new preconditioning training algorithms. Our findings offer a unified mathematical framework for understanding various acceleration techniques and deriving appropriate regularization schemes.
翻译:加速训练算法(如自适应学习率与各类归一化方法)已获广泛应用,但其内在机理尚未被完全理解。当引入正则化时,自适应学习率等标准优化器可能无法有效工作。这催生了对替代正则化方法的需求,并引发了如何将正则化与预条件技术恰当结合的思考。本文基于预条件理论应对这些挑战:(1)阐释了采用AdaGrad、RMSProp及Adam进行预条件如何加速训练;(2)探究正则化与预条件的相互作用,系统阐述正则化变量选择的不同方案,并重点讨论梯度正则化的具体实现方法;(3)论证归一化方法如何通过改善Hessian矩阵条件数以加速训练,并探讨该视角如何启发新型预条件训练算法的设计。本研究为理解多种加速技术及推导适配的正则化方案提供了统一的数学框架。