The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.
翻译:某些训练手段(如提高学习率、应用批归一化)提升深度网络泛化能力的机制仍属未知。先前研究推测,相较于"尖锐"解,"平坦"解对未见数据具有更好的泛化性能,由此催生了多种平坦度度量指标(特别是损失函数Hessian矩阵最大特征值$\lambda_{max}$),以及直接优化平坦度的算法(如锐度感知最小化SAM[1])。然而亦有研究质疑$\lambda_{max}$与泛化能力之间的关联。本文通过以下发现进一步挑战了$\lambda_{max}$对泛化性能的影响力:(1)提高学习率在所有批次规模下均能降低$\lambda_{max}$,但泛化性能的提升有时在大批量下消失;(2)同步调整批次规模与学习率可在不改变泛化能力的前提下改变$\lambda_{max}$;(3)SAM在所有批次规模下均能产生更小的$\lambda_{max}$,但其泛化优势同样会随批次增大而消失;(4)对于dropout,过高的丢弃概率即便能促进更小$\lambda_{max}$,仍会损伤泛化性能;(5)批归一化虽未能持续产生更小的$\lambda_{max}$,却仍能带来泛化收益。尽管本实验证实了提高学习率和SAM对小批量SGD的泛化增益,但全梯度下降与小批量梯度下降之间的差异揭示了$\lambda_{max}$解释神经网络泛化能力的局限性。