The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.
翻译:"基于图像的思考"范式使多模态大语言模型(MLLMs)能够通过放大工具主动探索视觉场景。这对于超高分(UHR)遥感视觉问答(VQA)至关重要,因为任务相关的线索稀疏且微小。然而,我们观察到现有支持放大的MLLMs中存在一个一致的失效模式:工具使用同质化,即工具调用坍缩为与任务无关的模式,限制了有效证据的获取。为解决此问题,我们提出GeoEyes,一个分阶段训练框架,包含:(1)冷启动监督微调数据集UHR Chain-of-Zoom(UHR-CoZ),涵盖多样化的放大机制;(2)一种智能体强化学习方法AdaZoom-GRPO,该方法在放大交互过程中明确奖励证据增益与答案改进。最终训练的模型学会了按需放大并具备适当的停止行为,在UHR遥感基准测试上取得了显著提升,在XLRS-Bench上的准确率达到54.23%。