Towards Understanding the Role of Sharpness-Aware Minimization Algorithms for Out-of-Distribution Generalization

Recently, sharpness-aware minimization (SAM) has emerged as a promising method to improve generalization by minimizing sharpness, which is known to correlate well with generalization ability. Since the original proposal of SAM, many variants of SAM have been proposed to improve its accuracy and efficiency, but comparisons have mainly been restricted to the i.i.d. setting. In this paper we study SAM for out-of-distribution (OOD) generalization. First, we perform a comprehensive comparison of eight SAM variants on zero-shot OOD generalization, finding that the original SAM outperforms the Adam baseline by $4.76\%$ and the strongest SAM variants outperform the Adam baseline by $8.01\%$ on average. We then provide an OOD generalization bound in terms of sharpness for this setting. Next, we extend our study of SAM to the related setting of gradual domain adaptation (GDA), another form of OOD generalization where intermediate domains are constructed between the source and target domains, and iterative self-training is done on intermediate domains, to improve the overall target domain error. In this setting, our experimental results demonstrate that the original SAM outperforms the baseline of Adam on each of the experimental datasets by $0.82\%$ on average and the strongest SAM variants outperform Adam by $1.52\%$ on average. We then provide a generalization bound for SAM in the GDA setting. Asymptotically, this generalization bound is no better than the one for self-training in the literature of GDA. This highlights a further disconnection between the theoretical justification for SAM versus its empirical performance, with recent work finding that low sharpness alone does not account for all of SAM's generalization benefits. For future work, we provide several potential avenues for obtaining a tighter analysis for SAM in the OOD setting.

翻译：近年来，锐度感知最小化（SAM）已成为一种通过最小化锐度来提升泛化能力的有前景方法，已知锐度与泛化能力高度相关。自SAM最初提出以来，已涌现出多种改进其精度与效率的变体，但相关比较主要局限于独立同分布（i.i.d.）场景。本文研究SAM在分布外（OOD）泛化中的应用。首先，我们在零样本OOD泛化任务上对八种SAM变体进行了全面比较，发现原始SAM平均比Adam基线提升$4.76\%$，而最优SAM变体平均比Adam基线提升$8.01\%$。随后，我们针对该场景提出了基于锐度的OOD泛化界。接着，我们将SAM研究扩展至渐进域适应（GDA）这一相关场景——作为OOD泛化的另一种形式，该方法通过在源域与目标域之间构建中间域，并在中间域上进行迭代自训练以降低整体目标域误差。在此场景中，实验结果表明原始SAM在各项实验数据集上平均比Adam基线提升$0.82\%$，而最优SAM变体平均比Adam基线提升$1.52\%$。我们进一步给出了GDA场景中SAM的泛化界。渐近分析表明，该泛化界并未优于现有GDA文献中自训练方法的理论界。这凸显了SAM理论依据与其实际性能之间的脱节，近期研究亦发现低锐度本身并不能完全解释SAM带来的泛化优势。最后，我们为未来研究提出了若干潜在方向，以期在OOD场景中获得更精确的SAM理论分析。