There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster convergence rate and better performance. Second, following the design of depth-wise convolution, we create a corresponding depth-wise vision Transformer that is more efficient with comparable performance. The potential of the proposed unified interpretation is not limited to the given examples and we hope it can inspire the community and give rise to more advanced network architectures.
翻译:关于视觉Transformer与ConvNet孰优孰劣的争论一直存在,它们通常被视为计算机视觉模型的两类截然不同的架构。然而,本文中我们将视觉Transformer解释为具有动态卷积的ConvNet,这使得我们能够在统一框架下表征现有Transformer与动态ConvNet,并并排比较它们的设计选择。此外,我们的解释还能指导网络设计:研究者如今可以从ConvNet的设计空间考虑视觉Transformer,反之亦然。我们通过两项具体研究展示了这一潜力。首先,我们审视了视觉Transformer中softmax作为激活函数的作用,发现它可被常用的ConvNet模块如ReLU和层归一化替代,从而带来更快的收敛速度和更优的性能。其次,遵循深度可分离卷积的设计思路,我们创建了对应的深度可分离视觉Transformer,在保持相当性能的同时更为高效。所提出的统一解释的潜力不限于所举示例,我们希望它能启发学界,催生更先进的网络架构。