Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose \textbf{RS2-SAM2}, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.
翻译:遥感指代图像分割(RRSIS)旨在根据文本描述对遥感图像中的目标物体进行分割。尽管Segment Anything Model 2(SAM2)在各种分割任务中展现出卓越性能,但其在RRSIS中的应用仍面临多项挑战,包括理解文本描述的遥感场景以及从文本生成有效的提示。为解决这些问题,我们提出**RS2-SAM2**——一种通过对齐适配的遥感特征与文本特征,并提供基于伪掩码的密集提示,从而将SAM2适配至RRSIS任务的新框架。具体而言,我们采用联合编码器对视觉与文本输入进行共同编码,生成对齐的视觉嵌入、文本嵌入以及多模态类别标记。通过引入双向层次融合模块,使SAM2适配遥感场景,并将适配后的视觉特征与视觉增强的文本嵌入对齐,从而提升模型对文本描述遥感场景的理解能力。为向SAM2提供精确的目标线索,我们设计了掩码提示生成器,该模块以视觉嵌入和类别标记为输入,生成伪掩码作为SAM2的密集提示。在多个RRSIS基准数据集上的实验结果表明,RS2-SAM2实现了最先进的性能。