Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we introduce the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture. For example, transformers can be more equivariant than convolutional neural networks after training.
翻译:等变性保证了模型的预测能够捕捉数据中的关键对称性。当图像发生平移或旋转时,等变模型对该图像的表示也会相应地平移或旋转。卷积神经网络的成功历来与其架构中直接编码的平移等变性密切相关。视觉Transformer的日益成功——其架构本身并无明确的等变性偏置——对这一传统观点提出了挑战,并表明数据增强和训练数据也可能在其性能中扮演重要角色。为了更好地理解等变性在近期视觉模型中的作用,我们引入了李导数,这是一种基于坚实数学基础且超参数极少的等变性度量方法。利用李导数,我们研究了数百个预训练模型的等变特性,涵盖CNN、Transformer以及Mixer架构。我们分析的规模使我们能够将架构的影响与模型大小或训练方法等其他因素区分开来。令人惊讶的是,我们发现许多等变性破坏现象可归因于普遍存在的网络层(如逐点非线性)中的空间混叠效应,并且随着模型变得更大、更准确,它们往往会表现出更强的等变性,而与架构无关。例如,训练后的Transformer可能比卷积神经网络具有更强的等变性。