Given an image and a natural language expression as input, the goal of referring image segmentation is to segment the foreground masks of the entities referred by the expression. Existing methods mainly focus on interactive learning between vision and language to enhance the multi-modal representations for global context reasoning. However, predicting directly in pixel-level space can lead to collapsed positioning and poor segmentation results. Its main challenge lies in how to explicitly model entity localization, especially for non-salient entities. In this paper, we tackle this problem by executing a Collaborative Position Reasoning Network (CPRN) via the proposed novel Row-and-Column interactive (RoCo) and Guided Holistic interactive (Holi) modules. Specifically, RoCo aggregates the visual features into the row- and column-wise features corresponding two directional axes respectively. It offers a fine-grained matching behavior that perceives the associations between the linguistic features and two decoupled visual features to perform position reasoning over a hierarchical space. Holi integrates features of the two modalities by a cross-modal attention mechanism, which suppresses the irrelevant redundancy under the guide of positioning information from RoCo. Thus, with the incorporation of RoCo and Holi modules, CPRN captures the visual details of position reasoning so that the model can achieve more accurate segmentation. To our knowledge, this is the first work that explicitly focuses on position reasoning modeling. We also validate the proposed method on three evaluation datasets. It consistently outperforms existing state-of-the-art methods.
翻译:给定一张图像和一句自然语言表达作为输入,指代图像分割的目标是分割出表达所指代实体的前景掩码。现有方法主要关注视觉与语言之间的交互学习,以增强用于全局上下文推理的多模态表示。然而,在像素级空间直接进行预测可能导致定位塌缩和较差的分割结果。其核心挑战在于如何显式建模实体定位,尤其是对于非显著实体。本文通过所提出的新型行列交互模块(RoCo)和导向整体交互模块(Holi),执行协同位置推理网络(CPRN)来解决这一问题。具体而言,RoCo将视觉特征分别聚合为对应两个方向轴的行向和列向特征,提供了细粒度的匹配行为,感知语言特征与两种解耦视觉特征之间的关联,从而在层次化空间中进行位置推理。Holi通过跨模态注意力机制融合两种模态的特征,在RoCo定位信息的引导下抑制无关冗余。因此,通过整合RoCo和Holi模块,CPRN捕获了位置推理的视觉细节,使模型能够实现更精确的分割。据我们所知,这是首个显式关注位置推理建模的工作。我们在三个评估数据集上验证了所提方法,其持续优于现有最先进方法。