We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention ($\mathrm{M^3Att}$) and Multi-Modal Mutual Decoder ($\mathrm{M^3Dec}$) that better fuse information from the two input modalities. Based on {$\mathrm{M^3Dec}$}, we further propose Iterative Multi-modal Interaction ($\mathrm{IMI}$) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction ($\mathrm{LFR}$) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.
翻译:我们研究旨在通过自然语言表达生成目标对象掩码的指代图像分割问题。近期众多工作采用Transformer架构,通过聚合关注视觉区域来提取目标对象特征。然而,Transformer中的通用注意力机制仅将语言输入用于注意力权重计算,其输出并未显式融合语言特征,导致输出特征以视觉信息为主导。这种局限性使得模型难以全面理解多模态信息,并为后续掩码解码器提取输出掩码带来不确定性。针对该问题,我们提出多模态互注意($\mathrm{M^3Att}$)与多模态互解码器($\mathrm{M^3Dec}$),实现两种输入模态信息的更优融合。基于$\mathrm{M^3Dec}$,我们进一步提出迭代多模态交互($\mathrm{IMI}$)机制,实现语言与视觉特征的持续深度交互。此外,我们引入语言特征重建($\mathrm{LFR}$)策略,防止提取特征中的语言信息丢失或失真。大量实验表明,本文方法显著提升基线性能,并在RefCOCO系列数据集上持续超越现有最优指代图像分割方法。