On the Maximum Hessian Eigenvalue and Generalization

from arxiv, Proceedings on "I Can't Believe It's Not Better! - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, PMLR 187:51-65, 2023

The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.

翻译：某些训练手段（如提高学习率、应用批归一化）提升深度网络泛化能力的机制仍属未知。先前研究推测，相较于"尖锐"解，"平坦"解对未见数据具有更好的泛化性能，由此催生了多种平坦度度量指标（特别是损失函数Hessian矩阵最大特征值$\lambda_{max}$），以及直接优化平坦度的算法（如锐度感知最小化SAM[1]）。然而亦有研究质疑$\lambda_{max}$与泛化能力之间的关联。本文通过以下发现进一步挑战了$\lambda_{max}$对泛化性能的影响力：（1）提高学习率在所有批次规模下均能降低$\lambda_{max}$，但泛化性能的提升有时在大批量下消失；（2）同步调整批次规模与学习率可在不改变泛化能力的前提下改变$\lambda_{max}$；（3）SAM在所有批次规模下均能产生更小的$\lambda_{max}$，但其泛化优势同样会随批次增大而消失；（4）对于dropout，过高的丢弃概率即便能促进更小$\lambda_{max}$，仍会损伤泛化性能；（5）批归一化虽未能持续产生更小的$\lambda_{max}$，却仍能带来泛化收益。尽管本实验证实了提高学习率和SAM对小批量SGD的泛化增益，但全梯度下降与小批量梯度下降之间的差异揭示了$\lambda_{max}$解释神经网络泛化能力的局限性。