Given a language expression, referring remote sensing image segmentation (RRSIS) aims to identify ground objects and assign pixel-wise labels within the imagery. The one of key challenges for this task is to capture discriminative multi-modal features via text-image alignment. However, the existing RRSIS methods use one vanilla and coarse alignment, where the language expression is directly extracted to be fused with the visual features. In this paper, we argue that a ``fine-grained image-text alignment'' can improve the extraction of multi-modal information. To this point, we propose a new referring remote sensing image segmentation method to fully exploit the visual and linguistic representations. Specifically, the original referring expression is regarded as context text, which is further decoupled into the ground object and spatial position texts. The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts, obtaining better discriminative multi-modal representation. Meanwhile, to handle the various scales of ground objects in remote sensing, we introduce a Text-aware Multi-scale Enhancement Module (TMEM) to adaptively perform cross-scale fusion and intersections. We evaluate the effectiveness of the proposed method on two public referring remote sensing datasets including RefSegRS and RRSIS-D, and our method obtains superior performance over several state-of-the-art methods. The code will be publicly available at https://github.com/Shaosifan/FIANet.
翻译:给定一个语言描述,面向遥感图像的指代分割(RRSIS)的目标是识别地面物体并在图像内分配像素级标签。该任务的关键挑战之一是通过图文对齐来捕获具有判别性的多模态特征。然而,现有的RRSIS方法采用一种简单且粗糙的对齐方式,即直接将语言描述提取出来与视觉特征进行融合。本文认为,“细粒度的图文对齐”能够改善多模态信息的提取。为此,我们提出了一种新的面向遥感图像的指代分割方法,以充分利用视觉和语言表征。具体而言,原始的指代表述被视为上下文文本,并进一步解耦为地面物体文本和空间位置文本。所提出的细粒度图文对齐模块(FIAM)将同时利用输入图像和相应文本的特征,从而获得更好的判别性多模态表征。同时,为了处理遥感图像中地面物体的各种尺度,我们引入了文本感知多尺度增强模块(TMEM),以自适应地执行跨尺度融合与交互。我们在两个公开的面向遥感图像的指代分割数据集(包括RefSegRS和RRSIS-D)上评估了所提方法的有效性,结果表明我们的方法优于多种最先进的方法。代码将在 https://github.com/Shaosifan/FIANet 公开。