Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.
翻译:遥感视觉定位(RSVG)旨在根据自然语言描述,在大范围航空影像中定位目标物体。由于遥感场景空间尺度巨大且语义模糊度高,这些描述通常严重依赖位置线索,这对多模态大语言模型(MLLMs)的空间推理能力提出了独特挑战。为利用这一特性,我们提出一种推理引导、位置感知的后训练框架,命名为 \textbf{RSGround-R1},以逐步增强空间理解能力。具体而言,我们首先引入思维链监督微调(CoT-SFT),利用合成的RSVG推理数据建立明确的位置感知。随后应用强化微调(RFT),并通过我们新设计的位置奖励进行增强,该奖励为精确定位提供连续且距离感知的引导。此外,为缓解多次迭代中定位行为的不一致问题,我们引入了一种空间一致性引导的优化方案,该方案根据策略更新的空间一致性动态调整更新过程,确保稳定且鲁棒的收敛。在RSVG基准测试上的大量实验证明了我们模型卓越的性能和泛化能力。