Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.

翻译：遥感视觉定位（RSVG）旨在根据自然语言表达式，在遥感图像或视频中定位所指目标。现有RSVG方法通常依赖特定任务的手动标注，这些标注收集成本高昂，且不可避免地无法覆盖真实地理空间场景的多样性。因此，它们常难以泛化至涉及新物体、细粒度属性、复杂空间关系及功能性语义的开放词汇查询。本文提出RSVG-ZeroOV，一种利用冻结通用基础模型实现零样本开放词汇RSVG的免训练框架。RSVG-ZeroOV遵循“概览-聚焦-演化”范式，通过利用视觉语言模型（VLM）与扩散模型（DM）独特且互补的注意力模式，逐步生成精确定位结果。具体而言：（i）“概览”模块利用VLM提取捕捉指代表达与视觉区域间语义关联的交叉注意力图；（ii）“聚焦”模块借助DM的细粒度建模先验，补偿VLM注意力常忽略的物体结构与形状信息；（iii）“演化”模块引入简单而有效的注意力演化机制，抑制无关激活，生成纯净的物体掩码。为处理视频输入，我们进一步提出Video RSVG-ZeroOV，通过查询关键帧选择器与时间传播器将图像级定位扩展至时空定位，无需视频标注或微调即可实现高效且时间连贯的视频定位。在六个图像与视频定位基准上的大量实验表明，RSVG-ZeroOV持续优于现有零样本基线，并与弱监督及全监督方法相比达到具有竞争力甚至更优的性能。