Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.
翻译:指代图像分割(RIS)旨在根据语言表达提示分割特定区域。现有方法将语言特征融入视觉特征,并获得多模态特征用于掩膜解码。然而,由于多模态特征受丰富视觉上下文主导,这些方法可能分割视觉显著实体而非正确的指代区域。本文提出MARIS,一种利用Segment Anything Model(SAM)并引入相互感知注意力机制的指代图像分割方法,通过两个并行分支增强跨模态融合。具体而言,我们的相互感知注意力机制包含视觉引导注意力和语言引导注意力,双向建模视觉与语言特征之间的关系。相应地,我们设计了掩膜解码器以实现显式语言引导,从而获得与语言表达更一致的分割结果。为此,提出多模态查询标记以同时集成语言信息并与视觉信息交互。在三个基准数据集上的大量实验表明,我们的方法优于现有最先进的RIS方法。我们的代码将公开提供。