Modern optimizers such as AdamW, equipped with momentum and adaptive learning rate, are designed to escape local minima and explore the vast parameter space. This exploration is beneficial for finding good loss basins when training from scratch. It is not necessarily ideal when resuming from a powerful foundation model because it can lead to large deviations from the pre-trained initialization and, consequently, worse robustness and generalization. At the same time, strong regularization on all parameters can lead to under-fitting. We hypothesize that selectively regularizing the parameter space is the key to fitting and retraining the pre-trained knowledge. This paper proposes a new weight decay technique, Selective Projection Decay (SPD), that selectively imposes a strong penalty on certain layers while allowing others to change freely. Intuitively, SPD expands and contracts the parameter search space for layers with consistent and inconsistent loss reduction, respectively. Experimentally, when equipped with SPD, Adam consistently provides better in-distribution generalization and out-of-distribution robustness performance on multiple popular vision and language benchmarks. Code available at~\url{https://github.com/GT-RIPL/Selective-Projection-Decay.git}
翻译:现代优化器(如AdamW)配备了动量和自适应学习率,旨在逃离局部极小值并探索广阔的参数空间。这种探索在从头训练时有助于找到良好的损失盆地,但从强大的基础模型继续训练时未必理想,因为它可能导致与预训练初始化的显著偏离,从而损害鲁棒性和泛化能力。同时,对所有参数施加强正则化可能导致欠拟合。我们假设选择性正则化参数空间是拟合与保持预训练知识的关键。本文提出一种新的权重衰减技术——选择性投影衰减(SPD),该技术对特定层施加强惩罚,同时允许其他层自由变化。直观上,SPD分别对损失减少一致与不一致的层扩展和收缩参数搜索空间。实验表明,配备SPD的Adam优化器在多个主流视觉与语言基准测试中,持续展现出更优的域内泛化能力和域外鲁棒性。代码发布于~\url{https://github.com/GT-RIPL/Selective-Projection-Decay.git}