TiC: Exploring Vision Transformer in Convolution

While models derived from Vision Transformers (ViTs) have been phonemically surging, pre-trained models cannot seamlessly adapt to arbitrary resolution images without altering the architecture and configuration, such as sampling the positional encoding, limiting their flexibility for various vision tasks. For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all input images to be resized to 1024$\times$1024. To overcome this limitation, we propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. Enabling transformers to handle images of varying sizes without retraining or rescaling, the use of MSA-Conv further reduces computational costs compared to global attention in ViT, which grows costly as image size increases. Later, we present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv, where two capacity enhancing strategies, namely Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing long-distance connections between tokens and enlarging the effective receptive field. Extensive experiments have been carried out to validate the overall effectiveness of TiC. Additionally, ablation studies confirm the performance improvement made by MSA-Conv and the two capacity enhancing strategies separately. Note that our proposal aims at studying an alternative to the global attention used in ViT, while MSA-Conv meets our goal by making TiC comparable to state-of-the-art on ImageNet-1K. Code will be released at https://github.com/zs670980918/MSA-Conv.

翻译：尽管基于视觉Transformer（ViT）的模型在性能上持续飙升，但预训练模型无法在不改变架构和配置（例如对位置编码进行采样）的情况下无缝适应任意分辨率的图像，这限制了其在各种视觉任务中的灵活性。例如，基于ViT-Huge的Segment Anything Model（SAM）要求将所有输入图像调整为1024×1024。为了克服这一限制，我们提出了多头自注意力卷积（MSA-Conv），它将自注意力机制融入广义卷积（包括标准卷积、空洞卷积和深度可分离卷积）中。MSA-Conv使Transformer能够处理不同尺寸的图像而无需重新训练或缩放，并且与ViT中随图像尺寸增大而计算成本急剧增加的全局注意力相比，进一步降低了计算成本。随后，我们提出了卷积中的视觉Transformer（TiC）作为MSA-Conv在图像分类任务中的概念验证，其中提出了两种能力增强策略，即多向循环移位机制和池化间机制，通过建立token之间的长距离连接并扩大有效感受野来实现。我们进行了大量实验以验证TiC的整体有效性。此外，消融研究分别证实了MSA-Conv以及两种能力增强策略带来的性能提升。需要指出的是，我们的研究旨在探索ViT中全局注意力的替代方案，而MSA-Conv通过使TiC在ImageNet-1K上达到与最先进模型相当的性能，实现了这一目标。代码将在https://github.com/zs670980918/MSA-Conv发布。