Scattering Vision Transformer: Spectral Mixing Matters

Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such as pooling, to reduce computational cost. Unfortunately, such operations are non-invertible and can result in information loss. In this paper, we present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally scattering network that enables the capture of intricate image details. SVT overcomes the invertibility issue associated with down-sampling operations by separating low-frequency and high-frequency components. Furthermore, SVT introduces a unique spectral gating network utilizing Einstein multiplication for token and channel mixing, effectively reducing complexity. We show that SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS. SVT shows 2\% improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy, while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L reaches 85.7\% (again state-of-art for large versions). SVT also shows comparable results in other vision tasks such as instance segmentation. SVT also outperforms other transformers in transfer learning on standard datasets such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The project page is available on this webpage.\url{https://badripatro.github.io/svt/}.

翻译：视觉Transformer在多种计算机视觉任务中（包括图像分类、实例分割和目标检测）已获得广泛关注并取得了最先进的性能。然而，在解决注意力复杂性和有效捕获图像中的细粒度信息方面仍存在挑战。现有解决方案通常采用降采样操作（如池化）来降低计算成本。遗憾的是，此类操作不可逆且可能导致信息丢失。本文提出一种名为散射视觉Transformer（SVT）的新方法以应对这些挑战。SVT集成了频谱散射网络，能够捕获复杂的图像细节。通过分离低频和高频分量，SVT克服了与降采样操作相关的不可逆性问题。此外，SVT引入了一种独特的频谱门控网络，利用爱因斯坦乘法实现令牌和通道混合，有效降低了复杂度。我们证明，SVT在ImageNet数据集上以显著减少的参数数量和计算量实现了最先进的性能。与LiTv2和iFormer相比，SVT提升了2%。SVT-H-S达到84.2%的top-1准确率，SVT-H-B达到85.2%（基础版本中最优），SVT-H-L达到85.7%（大版本中最优）。SVT在实例分割等其他视觉任务中也展现出可比结果。在CIFAR10、CIFAR100、Oxford Flower和Stanford Car等标准数据集的迁移学习任务中，SVT同样优于其他Transformer。项目页面位于\url{https://badripatro.github.io/svt/}。