Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown that Transformers are inherently low-pass filters that gradually oversmooth the inputs, reducing the expressivity of their representations. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we show that in fact Transformers are not inherently low-pass filters. Instead, whether Transformers oversmooth or not depends on the eigenspectrum of their update equations. Our analysis extends prior work in oversmoothing and in the closely-related phenomenon of rank collapse. We show that many successful Transformer models have attention and weights which satisfy conditions that avoid oversmoothing. Based on this analysis, we derive a simple way to parameterize the weights of the Transformer update equations that allows for control over its spectrum, ensuring that oversmoothing does not occur. Compared to a recent solution for oversmoothing, our approach improves generalization, even when training with more layers, fewer datapoints, and data that is corrupted.
翻译:基于Transformer的模型近年来在多个领域取得了巨大成功。然而,近期研究表明Transformer本质上是低通滤波器,会逐渐使输入过度平滑,从而降低其表征的表示能力。一个自然的问题是:在此缺陷下,Transformer如何能取得这些成功?本文证明,Transformer并非天然的低通滤波器。相反,Transformer是否过平滑取决于其更新方程的特征谱。我们的分析扩展了过平滑及密切相关的秩坍缩现象的现有研究。研究表明,许多成功的Transformer模型具备满足避免过平滑条件的注意力机制和权重。基于此分析,我们推导出一种简单的参数化Transformer更新方程权重的方法,该方法可控制其特征谱从而确保不发生过度平滑。与近期提出的过平滑解决方案相比,我们的方法在训练更多层、更少数据点以及受污染数据时,仍能提升泛化能力。