Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: \url{https://github.com/yws-wxs/Vim-F}.

翻译：近年来，具有高效硬件感知设计的状态空间模型（SSMs），即Mamba深度学习模型，在语言理解等长序列建模方面取得了显著进展。因此，基于SSMs构建高效且通用的视觉主干网络是一个有前景的方向。与传统卷积神经网络（CNNs）和视觉Transformer（ViTs）相比，视觉Mamba（ViM）方法的性能尚未完全具备竞争力。为了使SSMs能够处理图像数据，ViM通常将二维图像展平为一维序列，这不可避免地忽略了一些二维局部依赖关系，从而削弱了模型从全局视角解析空间关系的能力。我们使用快速傅里叶变换（FFT）获取特征图的频谱，并将其叠加到原始特征图上，使ViM能够在频域和空间域对统一的视觉表示进行建模。频域信息的引入使得ViM在扫描过程中能够拥有全局感受野。我们提出了一种名为Vim-F的新型模型，该模型采用纯Mamba编码器，并在频域和空间域进行扫描。此外，我们对ViM中位置嵌入的必要性提出质疑，并因此在Vim-F中将其移除，这有助于充分利用ViM高效的长序列建模能力。最后，我们为Vim-F重新设计了块嵌入，利用卷积主干捕获更多局部相关性，进一步提升了Vim-F的性能。代码发布于：\url{https://github.com/yws-wxs/Vim-F}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日