Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

翻译：近年来，过参数化神经网络的训练在文献中得到了广泛研究。由于其高度非凸和非线性的几何特性，对过参数化网络进行正则化是一个重要考量。本文研究噪声注入算法，该算法能够正则化损失的Hessian矩阵，从而导向具有平坦损失曲面的区域。具体而言，通过向神经网络的权重矩阵注入各向同性高斯噪声，我们可以获得Hessian迹的无偏估计。然而，若在反向传播前直接向权重矩阵添加噪声的朴素实现方式，其经验改进效果有限。为克服这一局限，我们设计了一种Hessian惩罚项的双点估计方法，该方法沿随机噪声的正负两个方向对权重矩阵进行噪声注入。特别地，这种双点估计消除了Hessian一阶泰勒展开项的方差。我们给出了一个依赖于Hessian迹（及权重空间半径）的PAC-Bayes泛化界，该界可从数据中测量。我们通过详细的实验研究验证了所提方法的有效性，表明其能有效正则化Hessian并提升泛化性能。首先，在锐度降低训练任务中，我们的算法优于现有方法：在六个图像分类数据集上对ResNet进行微调时，测试准确率最高提升2.4%。此外，采用我们的方法后，Hessian迹降低15.8%，最大特征值减少9.7%。我们还发现Hessian正则化可与权重衰减及数据增强相结合，实现更强的正则化效果。其次，该方法在提升多模态CLIP模型预训练和思维链微调的泛化能力方面持续有效。