Mask Grounding for Referring Image Segmentation

Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.

翻译：指代图像分割是一项极具挑战性的任务，要求算法能够分割出由自由形式语言表达所指代的目标对象。尽管近年来取得了显著进展，但多数最先进方法仍存在像素级与词语级的语言-图像模态差异。这些方法通常：1）依赖句子级语言特征进行语言-图像对齐；2）缺乏针对细粒度视觉定位的显式训练监督。因此，它们在视觉与语言特征之间表现出较弱的物体级对应关系。由于缺乏良好定位的特征，现有方法难以理解需要对多个目标间关系进行强推理的复杂表达式，尤其在处理低频或歧义从句时。为解决这一挑战，我们提出一种新颖的掩码定位辅助任务，通过显式教导模型学习掩码文本标记与其匹配视觉目标之间的细粒度对应，显著提升语言特征中的视觉定位能力。掩码定位可直接应用于现有指代分割方法并持续带来性能提升。此外，为从整体上弥合模态差异，我们还设计了一种跨模态对齐损失及其配套对齐模块，这些组件与掩码定位协同工作。综合运用上述技术，我们构建了MagNet（掩码定位网络）架构，该架构在RefCOCO、RefCOCO+和G-Ref三个关键基准上显著超越现有方法，证明了我们方法在解决当前指代分割算法局限性方面的有效性。我们的代码与预训练权重将公开发布。