Sharpness-Aware Minimization (SAM) is most known for achieving state-of the-art performances on natural image and language tasks. However, its most pronounced improvements (of tens of percent) is rather in the presence of label noise. Understanding SAM's label noise robustness requires a departure from characterizing the robustness of minimas lying in "flatter" regions of the loss landscape. In particular, the peak performance under label noise occurs with early stopping, far before the loss converges. We decompose SAM's robustness into two effects: one induced by changes to the logit term and the other induced by changes to the network Jacobian. The first can be observed in linear logistic regression where SAM provably up-weights the gradient contribution from clean examples. Although this explicit up-weighting is also observable in neural networks, when we intervene and modify SAM to remove this effect, surprisingly, we see no visible degradation in performance. We infer that SAM's effect in deeper networks is instead explained entirely by the effect SAM has on the network Jacobian. We theoretically derive the implicit regularization induced by this Jacobian effect in two layer linear networks. Motivated by our analysis, we see that cheaper alternatives to SAM that explicitly induce these regularization effects largely recover the benefits in deep networks trained on real-world datasets.
翻译:锐度感知最小化(Sharpness-Aware Minimization, SAM)因其在自然图像和语言任务中达到最先进性能而闻名。然而,其最显著的提升(达数十个百分点)反而出现在标签噪声场景下。理解SAM对标签噪声的鲁棒性需要突破传统视角,即不能仅从损失景观中位于“更平坦”区域的最小值的鲁棒性来刻画。值得注意的是,存在标签噪声时,峰值性能出现在早停阶段,远在损失收敛之前。我们将SAM的鲁棒性分解为两种效应:一种由logit项变化引起,另一种由网络雅可比矩阵变化引起。前者可在线性逻辑回归中观察到,此时SAM以可证明的方式增加干净样本的梯度贡献。尽管这种显式加权在神经网络中也可观测,但当我们人为干预并修改SAM以消除该效应时,令人惊讶的是,性能并未出现明显下降。我们推断,在深层网络中,SAM的效应完全由其对网络雅可比矩阵的影响所解释。我们从理论上推导了这种雅可比效应在两层线性网络中引发的隐式正则化。受分析启发,我们发现比SAM更廉价的替代方法——若能显式诱导这些正则化效应——可在真实世界数据集训练的深层网络中基本恢复其性能优势。