Exploring Flat Minima for Domain Generalization with Large Learning Rates

Domain Generalization (DG) aims to generalize to arbitrary unseen domains. A promising approach to improve model generalization in DG is the identification of flat minima. One typical method for this task is SWAD, which involves averaging weights along the training trajectory. However, the success of weight averaging depends on the diversity of weights, which is limited when training with a small learning rate. Instead, we observe that leveraging a large learning rate can simultaneously promote weight diversity and facilitate the identification of flat regions in the loss landscape. However, employing a large learning rate suffers from the convergence problem, which cannot be resolved by simply averaging the training weights. To address this issue, we introduce a training strategy called Lookahead which involves the weight interpolation, instead of average, between fast and slow weights. The fast weight explores the weight space with a large learning rate, which is not converged while the slow weight interpolates with it to ensure the convergence. Besides, weight interpolation also helps identify flat minima by implicitly optimizing the local entropy loss that measures flatness. To further prevent overfitting during training, we propose two variants to regularize the training weight with weighted averaged weight or with accumulated history weight. Taking advantage of this new perspective, our methods achieve state-of-the-art performance on both classification and semantic segmentation domain generalization benchmarks. The code is available at https://github.com/koncle/DG-with-Large-LR.

翻译：域泛化（Domain Generalization, DG）旨在泛化至任意未见过的领域。在DG中提升模型泛化能力的一种有前景的方法是识别平坦极小值。该任务的一种典型方法是SWAD，它涉及沿训练轨迹对权重进行平均。然而，权重平均的成功依赖于权重的多样性，而使用小学习率训练时这种多样性会受到限制。相反，我们观察到，利用大学习率可以同时促进权重多样性并辅助识别损失景观中的平坦区域。但使用大学习率会导致收敛问题，而简单地平均训练权重无法解决这一问题。为解决此问题，我们引入了一种名为Lookahead的训练策略，它通过快权重与慢权重之间的权重插值（而非平均）来操作。快权重以较大学习率探索权重空间，虽未收敛，但慢权重通过与快权重的插值来确保收敛。此外，权重插值还能通过隐式优化衡量平坦性的局部熵损失来帮助识别平坦极小值。为在训练过程中进一步防止过拟合，我们提出了两种变体：使用加权平均权重或累积历史权重来正则化训练权重。借助这一新视角，我们的方法在分类和语义分割域泛化基准测试中均实现了最先进的性能。代码已开源至 https://github.com/koncle/DG-with-Large-LR。