ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation

Instance segmentation algorithms in remote sensing are typically based on conventional methods, limiting their application to seen scenarios and closed-set predictions. In this work, we propose a novel task called zero-shot remote sensing instance segmentation, aimed at identifying aerial objects that are absent from training data. Challenges arise when classifying aerial categories with high inter-class similarity and intra-class variance. Besides, the domain gap between vision-language models' pretraining datasets and remote sensing datasets hinders the zero-shot capabilities of the pretrained model when it is directly applied to remote sensing images. To address these challenges, we propose a $\textbf{Z}$ero-Sh$\textbf{o}$t $\textbf{R}$emote Sensing $\textbf{I}$nstance Segmentation framework, dubbed $\textbf{ZoRI}$. Our approach features a discrimination-enhanced classifier that uses refined textual embeddings to increase the awareness of class disparities. Instead of direct fine-tuning, we propose a knowledge-maintained adaptation strategy that decouples semantic-related information to preserve the pretrained vision-language alignment while adjusting features to capture remote sensing domain-specific visual cues. Additionally, we introduce a prior-injected prediction with cache bank of aerial visual prototypes to supplement the semantic richness of text embeddings and seamlessly integrate aerial representations, adapting to the remote sensing domain. We establish new experimental protocols and benchmarks, and extensive experiments convincingly demonstrate that ZoRI achieves the state-of-art performance on the zero-shot remote sensing instance segmentation task. Our code is available at https://github.com/HuangShiqi128/ZoRI.

翻译：遥感领域的实例分割算法通常基于传统方法，这限制了其在可见场景和封闭集预测中的应用。在本研究中，我们提出了一项名为零样本遥感实例分割的新任务，旨在识别训练数据中未出现的航空目标。当对具有高类间相似性和类内差异的航空类别进行分类时，挑战随之产生。此外，视觉-语言模型预训练数据集与遥感数据集之间的领域差距，阻碍了预训练模型直接应用于遥感图像时的零样本能力。为应对这些挑战，我们提出了一个名为ZoRI的零样本遥感实例分割框架。我们的方法采用了一个判别增强分类器，该分类器利用精炼的文本嵌入来增强对类别差异的感知。我们提出了一种知识保持适应策略，以取代直接微调，该策略解耦语义相关信息，在调整特征以捕捉遥感领域特定视觉线索的同时，保持预训练的视觉-语言对齐。此外，我们引入了基于航空视觉原型缓存库的先验注入预测，以补充文本嵌入的语义丰富性，并无缝集成航空表征，从而适应遥感领域。我们建立了新的实验协议和基准测试，大量实验令人信服地证明，ZoRI在零样本遥感实例分割任务上实现了最先进的性能。我们的代码可在 https://github.com/HuangShiqi128/ZoRI 获取。