The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.
翻译:Transformer架构的平滑性在泛化能力、训练稳定性和对抗鲁棒性方面已得到广泛研究。然而,其在迁移学习中的作用仍鲜为人知。本文分析了视觉Transformer组件根据输入变化调整其输出的能力,即其可塑性。该指标定义为平均变化率,用于捕捉对输入扰动的敏感性;具体而言,高可塑性意味着低平滑性。我们通过理论分析和全面实验证明,这一视角为选择适应过程中应优先调整的组件提供了原则性指导。对实践者的关键启示是:注意力模块和前馈层的高可塑性始终能带来更优的微调性能。我们的发现突破了“平滑性总是有利的”这一主流假设,为Transformer的功能特性提供了全新视角。代码发布于https://github.com/ambroiseodt/vit-plasticity。