We consider finding flat, local minimizers by adding average weight perturbations. Given a nonconvex function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ and a $d$-dimensional distribution $\mathcal{P}$ which is symmetric at zero, we perturb the weight of $f$ and define $F(W) = \mathbb{E}[f({W + U})]$, where $U$ is a random sample from $\mathcal{P}$. This injection induces regularization through the Hessian trace of $f$ for small, isotropic Gaussian perturbations. Thus, the weight-perturbed function biases to minimizers with low Hessian trace. Several prior works have studied settings related to this weight-perturbed function by designing algorithms to improve generalization. Still, convergence rates are not known for finding minima under the average perturbations of the function $F$. This paper considers an SGD-like algorithm that injects random noise before computing gradients while leveraging the symmetry of $\mathcal{P}$ to reduce variance. We then provide a rigorous analysis, showing matching upper and lower bounds of our algorithm for finding an approximate first-order stationary point of $F$ when the gradient of $f$ is Lipschitz-continuous. We empirically validate our algorithm for several image classification tasks with various architectures. Compared to sharpness-aware minimization, we note a 12.6% and 7.8% drop in the Hessian trace and top eigenvalue of the found minima, respectively, averaged over eight datasets. Ablation studies validate the benefit of the design of our algorithm.
翻译:考虑通过添加平均权重扰动来寻找平坦的局部极小点。给定非凸函数$f: \mathbb{R}^d \rightarrow \mathbb{R}$和对称于零的$d$维分布$\mathcal{P}$,我们对$f$的权重添加扰动并定义$F(W) = \mathbb{E}[f({W + U})]$,其中$U$是来自$\mathcal{P}$的随机样本。对于各向同性的高斯小扰动,这种注入通过$f$的Hessian迹实现正则化。因此,权重扰动函数倾向于选择Hessian迹较小的极小点。此前多项研究通过设计能提升泛化性能的算法,探讨了与此权重扰动函数相关的设定,但关于在函数$F$的平均扰动下寻找极小点的收敛速率仍属未知。本文提出一种类似SGD的算法,在计算梯度前注入随机噪声,同时利用$\mathcal{P}$的对称性降低方差。我们给出了严格分析,证明在$f$的梯度满足Lipschitz连续条件下,该算法寻找$F$近似一阶驻点的上下界是匹配的。我们在多种架构的图像分类任务上进行了实验验证。与锐度感知最小化相比,在八个数据集上,所找到极小点的Hessian迹和最大特征值平均分别下降12.6%和7.8%。消融实验进一步验证了算法设计的有效性。