Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models. Code and weights are available at \url{https://github.com/YuHengsss/VSSD}.
翻译:视觉Transformer显著推动了计算机视觉领域的发展,提供了强大的建模能力和全局感受野。然而,其高计算需求限制了其在处理长序列时的适用性。为解决这一问题,状态空间模型因其线性计算复杂度在视觉任务中日益受到关注。近期,Mamba2中引入了状态空间对偶性这一SSM改进变体,以提升模型性能与效率。然而,SSD/SSM固有的因果特性限制了其在非因果视觉任务中的应用。为突破此限制,我们提出了视觉状态空间对偶性模型,该模型具备SSD的非因果形式。具体而言,我们提出在保留隐藏状态与词元间相对权重的同时,摒弃其交互强度,从而解除词元贡献对先前词元的依赖。结合多扫描策略的引入,我们证明可通过整合扫描结果实现非因果性,这不仅提升了SSD在视觉任务中的性能,同时增强了其效率。我们在图像分类、检测与分割等多个基准测试上进行了广泛实验,结果表明VSSD超越了现有基于SSM的最先进模型。代码与权重已发布于\url{https://github.com/YuHengsss/VSSD}。