Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM's grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at https://github.com/josesosajs/trainfree-rs-segmentation.
翻译:近年来,视觉语言模型(VLMs)与视觉基础模型(VFMs)的进展为零样本文本引导的遥感图像分割开辟了新的机遇。然而,现有方法大多仍依赖于额外的可训练组件,这限制了其泛化能力和实际适用性。本研究探讨了在多大程度上,仅依靠现有基础模型即可实现无需额外训练的基于文本的遥感分割。我们提出了一种简单而有效的方法,将对比式与生成式VLM与Segment Anything Model(SAM)相结合,实现了一个完全无需训练或仅需轻量级LoRA微调的流程。我们的对比式方法采用CLIP作为SAM基于网格的候选掩码选择器,在完全零样本设置下实现了最先进的开放词汇语义分割(OVSS)。与此同时,我们的生成式方法通过使用GPT-4V(在零样本设置下)和经过LoRA微调的Qwen-VL模型为SAM生成点击提示,实现了推理与指代分割,其中后者取得了最佳效果。在涵盖开放词汇、指代及基于推理任务的19个遥感基准数据集上进行的大量实验,证明了我们方法的强大能力。代码将在 https://github.com/josesosajs/trainfree-rs-segmentation 发布。