Gradient regularization (GR) has been shown to improve the generalizability of trained models. While Natural Gradient Descent has been shown to accelerate optimization in the initial phase of training, little attention has been paid to how the training dynamics of second-order optimizers can benefit from GR. In this work, we propose Gradient-Regularized Natural Gradients (GRNG), a family of scalable second-order optimizers that integrate explicit gradient regularization with natural gradient updates. Our framework introduces two frequentist algorithms: Regularized Explicit Natural Gradient (RENG), which utilizes double backpropagation to explicitly minimize the gradient norm, and Regularized Implicit Natural Gradient (RING), which incorporates regularization implicitly into the update direction. We also propose a Bayesian variant based on a Regularized-Kalman formulation that eliminates the need for FIM inversion entirely. We establish convergence guarantees for GRNG, showing that gradient regularization improves stability and enables convergence to global minima. Empirically, we demonstrate that GRNG consistently enhances both optimization speed and generalization compared to first-order methods (SGD, AdamW) and second-order baselines (K-FAC, Sophia), with strong results on vision and language benchmarks.
翻译:梯度正则化(GR)已被证明能提高训练模型的泛化能力。尽管自然梯度下降在训练初期被证实可加速优化过程,但关于二阶优化器的训练动态如何从梯度正则化中获益的研究仍较为缺乏。本文提出梯度正则化自然梯度(GRNG),这是一类将显式梯度正则化与自然梯度更新相结合的可扩展二阶优化器族。我们的框架引入了两种频率学派算法:正则化显式自然梯度(RENG)通过双重反向传播显式最小化梯度范数,以及正则化隐式自然梯度(RING)将正则化隐式融入更新方向。我们还提出了一种基于正则化卡尔曼公式的贝叶斯变体,完全消除了对费舍尔信息矩阵(FIM)求逆的需求。我们为GRNG建立了收敛性保证,证明梯度正则化可提升稳定性并实现向全局最小值的收敛。实验表明,与一阶方法(SGD、AdamW)及二阶基线方法(K-FAC、Sophia)相比,GRNG在视觉和语言基准测试中均能持续提升优化速度与泛化性能。