The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.
翻译:近年来,视觉Mamba领域取得了进展,旨在替代具有二次复杂度的视觉Transformer(ViT)。尽管Mamba的循环扫描机制提供了计算效率,但它本质上限制了图像块之间的非因果交互。先前的研究尝试通过各种多扫描策略来解决这一限制;然而,这些方法由于次优的扫描设计和频繁的数据重排而效率低下。此外,在视觉任务中常用的短令牌长度下,Mamba的计算速度相对较慢。为了追求真正高效的视觉编码器,我们重新思考了视觉扫描操作和Mamba的计算效率。为此,我们提出了SF-Mamba,一种新颖的视觉Mamba模型,其包含两个关键提议:在单向扫描下通过辅助块交换来编码双向信息流,以及通过周期性状态重置的批量折叠来实现高级GPU并行化。在图像分类、目标检测、实例分割和语义分割上的大量实验一致表明,我们提出的SF-Mamba显著优于最先进的基线模型,同时在不同模型规模下提高了吞吐量。我们将在发表后开源代码。