Referring Image Segmentation (RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose ReMamber, a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve the state-of-the-art on three challenging benchmarks. Moreover, we conduct thorough analyses of ReMamber and discuss other fusion designs using Mamba. These provide valuable perspectives for future research.
翻译:指代图像分割(RIS)借助Transformer在处理复杂视觉-语言任务中取得了巨大成功。然而,二次计算复杂度使其在捕获长程视觉-语言依赖关系时消耗大量资源。幸运的是,Mamba凭借高效的线性复杂度解决了这一问题。然而,直接将Mamba应用于多模态交互面临挑战,主要源于通道交互不足,难以有效融合多模态数据。本文提出ReMamber——一种新颖的RIS架构,它将Mamba的能力与多模态Mamba Twister模块相结合。Mamba Twister显式建模图像-文本交互,并通过其独特的通道与空间扭转机制融合文本与视觉特征。我们在三个具有挑战性的基准上达到了最先进水平。此外,我们对ReMamber进行了深入分析,并讨论了使用Mamba的其他融合设计,为未来研究提供了宝贵视角。