Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.
翻译:指代图像分割(RIS)旨在根据自然语言描述分割目标物体。现有方法通过将视觉信息融入语言标记而不断演进。为更有效地利用视觉上下文实现细粒度分割,本文提出一种新颖的视觉信息部件注意力(VIPA)框架。VIPA利用视觉上下文中具有信息量的部件(称为视觉表达),能有效为网络提供结构性与语义性的视觉目标信息。该设计降低了高方差跨模态投影,并在指代图像分割的注意力机制中增强了语义一致性。我们还设计了视觉表达生成器(VEG)模块,该模块通过局部-全局语言上下文线索检索信息化视觉标记,并对检索到的标记进行细化以降低噪声信息并共享信息性视觉属性。该模块使视觉表达能够综合考虑完整上下文,并捕获信息区域的语义视觉上下文。通过这种方式,我们的框架使网络注意力能够与细粒度感兴趣区域实现稳健对齐。大量实验与可视化分析证明了本方法的有效性。我们的VIPA在四个公开RIS基准测试中超越了现有最先进方法。