Referring image segmentation aims to segment the image region of interest according to the given language expression, which is a typical multi-modal task. One of the critical challenges of this task is to align semantic representations for different modalities including vision and language. To achieve this, previous methods perform cross-modal interactions to update visual features but ignore the role of integrating fine-grained visual features into linguistic features. We present AlignFormer, an end-to-end framework for referring image segmentation. Our AlignFormer views the linguistic feature as the center embedding and segments the region of interest by pixels grouping based on the center embedding. For achieving the pixel-text alignment, we design a Vision-Language Bidirectional Attention module (VLBA) and resort contrastive learning. Concretely, the VLBA enhances visual features by propagating semantic text representations to each pixel and promotes linguistic features by fusing fine-grained image features. Moreover, we introduce the cross-modal instance contrastive loss to alleviate the influence of pixel samples in ambiguous regions and improve the ability to align multi-modal representations. Extensive experiments demonstrate that our AlignFormer achieves a new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg by large margins.
翻译:指代图像分割旨在根据给定的语言表达分割出感兴趣的图像区域,这是一个典型的多模态任务。该任务的关键挑战之一是实现视觉和语言等不同模态的语义表示对齐。为此,以往方法通过跨模态交互更新视觉特征,但忽略了将细粒度视觉特征整合到语言特征中的作用。我们提出AlignFormer,一个用于指代图像分割的端到端框架。我们的AlignFormer将语言特征视为中心嵌入,并基于该中心嵌入通过像素分组来分割感兴趣区域。为实现像素-文本对齐,我们设计了视觉-语言双向注意力模块(VLBA)并借助对比学习。具体而言,VLBA通过将语义文本表示传播到每个像素来增强视觉特征,并通过融合细粒度图像特征来提升语言特征。此外,我们引入跨模态实例对比损失以减轻模糊区域中像素样本的影响,并提高对齐多模态表示的能力。大量实验表明,我们的AlignFormer在RefCOCO、RefCOCO+和RefCOCOg数据集上以大幅度优势取得了新的最先进性能。