Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.
翻译:指代图像分割(RIS)旨在根据语言表达式提示分割特定区域。现有方法将语言特征融入视觉特征中,并获取多模态特征用于掩码解码。然而,由于多模态特征受丰富视觉上下文的支配,这些方法可能分割视觉显著实体而非正确的指代区域。本文提出MARIS,一种利用Segment Anything Model(SAM)的指代图像分割方法,并引入互感知注意力机制通过双并行分支增强跨模态融合。具体地,我们的互感知注意力机制由视觉引导注意力和语言引导注意力组成,双向建模视觉与语言特征之间的关系。相应地,我们设计了掩码解码器以实现显式语言引导,从而获得与语言表达式更一致的分割结果。为此,我们提出多模态查询令牌,以同步整合语言信息并与视觉信息交互。在三个基准数据集上的大量实验表明,本方法优于当前最先进的RIS方法。我们的代码将公开提供。