Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.
翻译:指代图像分割(RIS)旨在根据语言表达提示分割特定区域。现有方法将语言特征融入视觉特征,并获取多模态特征用于掩码解码。然而,由于多模态特征受丰富的视觉上下文主导,这些方法可能会分割视觉显著实体而非正确的指代区域。本文提出MARIS——一种利用分割一切模型(SAM)并通过两个并行分支引入互感知注意力机制以增强跨模态融合的指代图像分割方法。具体而言,该互感知注意力机制包括视觉引导注意力与语言引导注意力,双向建模视觉与语言特征间的关系。相应地,我们设计掩码解码器以实现显式语言引导,从而生成与语言表达更一致的掩码。为此,提出多模态查询令牌以同时集成语言信息并与视觉信息交互。在三个基准数据集上的大量实验表明,我们的方法优于现有最优RIS方法。代码将开源。