Transformers have achieved widespread success in computer vision. At their heart, there is a Self-Attention (SA) mechanism, an inductive bias that associates each token in the input with every other token through a weighted basis. The standard SA mechanism has quadratic complexity with the sequence length, which impedes its utility to long sequences appearing in high resolution vision. Recently, inspired by operator learning for PDEs, Adaptive Fourier Neural Operators (AFNO) were introduced for high resolution attention based on global convolution that is efficiently implemented via FFT. However, the AFNO global filtering cannot well represent small and moderate scale structures that commonly appear in natural images. To leverage the coarse-to-fine scale structures we introduce a Multiscale Wavelet Attention (MWA) by leveraging wavelet neural operators which incurs linear complexity in the sequence size. We replace the attention in ViT with MWA and our experiments with CIFAR and ImageNet classification demonstrate significant improvement over alternative Fourier-based attentions such as AFNO and Global Filter Network (GFN).
翻译:Transformer已在计算机视觉领域取得广泛应用。其核心是自注意力机制,这是一种通过加权基将输入中每个标记与其他标记相关联的归纳偏置。标准自注意力机制的计算复杂度与序列长度呈二次方关系,这限制了其在需要处理高分辨率图像长序列场景中的应用。受偏微分方程算子学习的启发,近年提出的自适应傅里叶神经算子通过基于全局卷积的快速傅里叶变换实现了高效的高分辨率注意力机制。然而,AFNO的全局滤波机制难以有效表征自然图像中常见的中小尺度结构。为充分利用由粗到精的层级结构特征,我们提出基于小波神经算子的多尺度小波注意力机制,该模型在序列规模上仅需线性复杂度。我们用MWA替代Vision Transformer中的原始注意力模块,在CIFAR和ImageNet分类任务上的实验表明,该方法相比基于傅里叶的注意力机制(如AFNO和全局滤波网络)具有显著性能提升。