Advances in the expressivity of pretrained models have increased interest in the design of adaptation protocols which enable safe and effective transfer learning. Going beyond conventional linear probing (LP) and fine tuning (FT) strategies, protocols that can effectively control feature distortion, i.e., the failure to update features orthogonal to the in-distribution, have been found to achieve improved out-of-distribution generalization (OOD). In order to limit this distortion, the LP+FT protocol, which first learns a linear probe and then uses this initialization for subsequent FT, was proposed. However, in this paper, we find when adaptation protocols (LP, FT, LP+FT) are also evaluated on a variety of safety objectives (e.g., calibration, robustness, etc.), a complementary perspective to feature distortion is helpful to explain protocol behavior. To this end, we study the susceptibility of protocols to simplicity bias (SB), i.e. the well-known propensity of deep neural networks to rely upon simple features, as SB has recently been shown to underlie several problems in robust generalization. Using a synthetic dataset, we demonstrate the susceptibility of existing protocols to SB. Given the strong effectiveness of LP+FT, we then propose modified linear probes that help mitigate SB, and lead to better initializations for subsequent FT. We verify the effectiveness of the proposed LP+FT variants for decreasing SB in a controlled setting, and their ability to improve OOD generalization and safety on three adaptation datasets.
翻译:预训练模型表达能力的提升,促使研究者关注能够实现安全有效迁移学习的适配协议设计。相较于传统线性探测(LP)与微调(FT)策略,能有效控制特征扭曲(即未能更新与分布内特征正交的表示)的协议在分布外泛化(OOD)方面表现更优。为限制这种扭曲,研究者提出先进行线性探测初始化、再进行微调的LP+FT协议。然而,本文发现当适配协议(LP、FT、LP+FT)同时在多个安全目标(如校准性、鲁棒性等)上评估时,需要引入特征扭曲的互补视角来解释协议行为。为此,我们探究了协议对简单性偏好(SB)的易感性——即深度神经网络依赖简单特征的固有倾向(近期研究表明这是鲁棒泛化问题的根源之一)。通过合成数据集,我们证实了现有协议对简单性偏好的敏感性。鉴于LP+FT的显著有效性,我们提出改进的线性探测方法以缓解简单性偏好,从而为后续微调提供更优初始化。在受控环境中验证了改进版LP+FT协议降低简单性偏好的有效性,并在三个迁移数据集上证实其能提升分布外泛化性能与安全性。