The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.
翻译:Transformer架构的平滑性已在泛化性、训练稳定性和对抗鲁棒性等背景下得到广泛研究。然而,其在迁移学习中的作用仍鲜为人知。本文中,我们分析了视觉Transformer组件适应其输出以响应输入变化的能力,换言之,即其可塑性。该指标定义为平均变化率,用以捕捉对输入扰动的敏感性;具体而言,高可塑性意味着低平滑性。我们通过理论分析和全面实验证明,这一视角为在适应过程中优先选择哪些组件提供了原则性指导。对实践者的一个重要启示是:注意力模块和前馈层的高可塑性始终能带来更优的微调性能。我们的发现突破了"平滑性总是可取的"这一主流假设,为Transformer的功能特性提供了新颖的视角。代码发布于 https://github.com/ambroiseodt/vit-plasticity。