The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their \emph{plasticity}. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies a low smoothness. Our theoretical analysis and extensive experiments -- over $1,000$ finetuning runs on large-scale vision transformers -- showcase that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on transformers' functional properties. The code is available at https://github.com/ambroiseodt/vit-plasticity.
翻译:Transformer架构的平滑性在泛化能力、训练稳定性和对抗鲁棒性方面已被广泛研究。然而,其在迁移学习中的作用仍待阐明。本文分析了视觉Transformer各组件根据输入变化调整输出的能力,即其"塑性"。该指标定义为平均变化率,衡量了对输入扰动的敏感性;特别地,高塑性意味着低平滑性。我们的理论分析与大规模实验——在超过1000次大规模视觉Transformer微调运行中——表明,该视角为选择适应过程中需要优先处理的组件提供了原则性指导。实践者的关键收获是:注意力模块与前馈层的高塑性始终能带来更优的微调性能。我们的发现挑战了"平滑性更优"的主流假设,为Transformer的功能特性提供了全新视角。相关代码已开源至https://github.com/ambroiseodt/vit-plasticity。