Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.
翻译:梯度正则化(GR)是一种在训练过程中对训练损失的梯度范数施加惩罚的方法。虽然已有研究报告GR能够提升泛化性能,但从算法角度——即高效提升性能的GR算法——却鲜有关注。本研究首先揭示,结合梯度上升与下降步骤的特定有限差分计算可降低GR的计算成本。接着,我们证明该有限差分计算在泛化性能方面亦表现更佳。通过理论分析可解模型(对角线性网络),我们阐明GR具有向所谓"富足机制"(rich regime)的合意隐式偏置,而有限差分计算可强化该偏置。此外,有限差分GR与某些基于迭代上升-下降步骤探索平坦极小点的其他算法密切相关。特别地,我们发现泛洪法(flooding method)能以隐式方式实现有限差分GR。因此,本研究从实践与理论两个层面拓展了对GR的理解。