Image geolocation aims to infer capture locations based on visual content. Fundamentally, this constitutes a reasoning process composed of \textit{hypothesis-verification cycles}, requiring models to possess both geospatial reasoning capabilities and the ability to verify evidence against geographic facts. Existing methods typically internalize location knowledge and reasoning patterns into static memory via supervised training or trajectory-based reinforcement fine-tuning. Consequently, these methods are prone to factual hallucinations and generalization bottlenecks in open-world settings or scenarios requiring dynamic knowledge. To address these challenges, we propose a Hierarchical Localization Agent, called LocationAgent. Our core philosophy is to retain hierarchical reasoning logic within the model while offloading the verification of geographic evidence to external tools. To implement hierarchical reasoning, we design the RER architecture (Reasoner-Executor-Recorder), which employs role separation and context compression to prevent the drifting problem in multi-step reasoning. For evidence verification, we construct a suite of clue exploration tools that provide diverse evidence to support location reasoning. Furthermore, to address data leakage and the scarcity of Chinese data in existing datasets, we introduce CCL-Bench (China City Location Bench), an image geolocation benchmark encompassing various scene granularities and difficulty levels. Extensive experiments demonstrate that LocationAgent significantly outperforms existing methods by at least 30\% in zero-shot settings.
翻译:图像地理定位旨在根据视觉内容推断拍摄位置。从根本上说,这构成了一个由\textit{假设-验证循环}组成的推理过程,要求模型同时具备地理空间推理能力和根据地理事实验证证据的能力。现有方法通常通过监督训练或基于轨迹的强化微调,将位置知识和推理模式内化为静态记忆。因此,这些方法在开放世界场景或需要动态知识的场景中容易出现事实幻觉和泛化瓶颈。为应对这些挑战,我们提出了一种层级定位智能体,称为LocationAgent。我们的核心理念是将层级推理逻辑保留在模型内部,同时将地理证据的验证卸载到外部工具。为实现层级推理,我们设计了RER架构(推理器-执行器-记录器),该架构采用角色分离和上下文压缩来防止多步推理中的漂移问题。对于证据验证,我们构建了一套线索探索工具,可提供多样化的证据以支持位置推理。此外,针对现有数据集中存在的数据泄露和中文数据稀缺问题,我们引入了CCL-Bench(中国城市定位基准),这是一个涵盖多种场景粒度和难度级别的图像地理定位基准。大量实验表明,LocationAgent在零样本设置下显著优于现有方法,性能提升至少30\%。