Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This "reason first, then act" process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at: https://github.com/Geo-R1/geo-r1.
翻译:遥感图像中的指代表达理解面临独特挑战,需要推理复杂的对象-上下文关系。尽管在多模态大语言模型上进行监督微调(SFT)可借助大规模标注数据集取得强劲性能,但在数据稀缺场景下,此类模型表现不佳,泛化能力较弱。为解决这一局限,我们提出Geo-R1,一种面向小样本地理空间指代任务的基于推理的强化微调(RFT)范式。Geo-R1强制模型首先生成显式、可解释的推理链,对指代表达进行分解,进而利用这些推理依据定位目标对象。这种"先推理后行动"的过程使模型能更有效地利用有限标注信息,增强泛化能力并提供可解释性。我们在三个精心设计的小样本地理空间指代基准上验证了Geo-R1,该模型持续且显著优于SFT基线方法。同时,Geo-R1展现出强大的跨数据集泛化能力,凸显其鲁棒性。代码与数据将在以下地址开源:https://github.com/Geo-R1/geo-r1。