The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named \textit{prompt-generated semantic localization guiding Segment Anything Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text-described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering-based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex scenes.We further contribute a high-quality, multi-category manually annotated dataset. Experimental validation on two datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant performance improvements and surpasses existing state-of-the-art models.Our code will be made publicly available.
翻译:参考遥感图像分割任务旨在根据文本描述生成图像中指定对象的分割掩码,该任务已引起广泛关注与研究兴趣。现有RRSIS方法依赖于多模态融合主干网络与语义分割头,但仍面临密集标注需求与复杂场景解析等挑战。为解决这些问题,我们提出名为"提示生成语义定位引导Segment Anything模型"的框架,将RRSIS任务分解为粗定位与精细分割两个阶段。在粗定位阶段,视觉定位网络对文本描述对象进行粗略定位;在精细分割阶段,首阶段获取的坐标信息引导Segment Anything模型,并通过基于聚类的前景点生成器与掩码边界迭代优化策略实现精确分割。值得注意的是,第二阶段可采用免训练方式,显著减轻RRSIS任务的标注数据负担。此外,将RRSIS任务分解为两阶段可实现特定区域分割聚焦,有效规避复杂场景干扰。我们进一步构建了高质量、多类别的人工标注数据集。在两个数据集上的实验验证表明,PSLG-SAM在RRSIS-D与RRSIS-M数据集上均取得显著性能提升,超越现有最先进模型。相关代码将公开提供。