Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.

翻译：开放词汇遥感图像分割（OVRSIS）是一项将开放词汇分割（OVS）应用于遥感（RS）领域的新兴任务，由于缺乏统一的评估基准以及自然图像与遥感图像之间的领域差异，该任务尚未得到充分探索。为弥合这些差距，我们首先基于广泛使用的遥感分割数据集建立了一个标准化的OVRSIS基准（\\textbf{OVRSISBench}），以实现方法间的一致评估。利用该基准，我们全面评估了几种具有代表性的OVS/OVRSIS模型，并揭示了它们直接应用于遥感场景时的局限性。基于这些发现，我们提出了\\textbf{RSKT-Seg}，一种专为遥感设计的创新开放词汇分割框架。RSKT-Seg整合了三个关键组件：（1）多方向成本图聚合（RS-CMA）模块，通过计算多个方向上的视觉-语言余弦相似性来捕获旋转不变的视觉线索；（2）高效成本图融合（RS-Fusion）Transformer，通过轻量级降维策略联合建模空间和语义依赖关系；（3）遥感知识迁移（RS-Transfer）模块，通过增强的上采样注入预训练知识并促进领域适应。在基准上的大量实验表明，RSKT-Seg始终优于强大的OVS基线，mIoU提升+3.8，mACC提升+5.9，同时通过高效聚合实现了2倍的推理加速。我们的代码位于\\href{https://github.com/LiBingyu01/RSKT-Seg}{\\textcolor{blue}{此处}}。