Geolocation, the task of identifying an image's location, requires complex reasoning and is crucial for navigation, monitoring, and cultural preservation. However, current methods often produce coarse, imprecise, and non-interpretable localization. A major challenge lies in the quality and scale of existing geolocation datasets. These datasets are typically small-scale and automatically constructed, leading to noisy data and inconsistent task difficulty, with images that either reveal answers too easily or lack sufficient clues for reliable inference. To address these challenges, we introduce a comprehensive geolocation framework with three key components: GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric, collectively designed to address critical challenges and drive advancements in geolocation research. At the core of this framework is GeoComp (Geolocation Competition Dataset), a large-scale dataset collected from a geolocation game platform involving 740K users over two years. It comprises 25 million entries of metadata and 3 million geo-tagged locations spanning much of the globe, with each location annotated thousands to tens of thousands of times by human users. The dataset offers diverse difficulty levels for detailed analysis and highlights key gaps in current models. Building on this dataset, we propose Geographical Chain-of-Thought (GeoCoT), a novel multi-step reasoning framework designed to enhance the reasoning capabilities of Large Vision Models (LVMs) in geolocation tasks. GeoCoT improves performance by integrating contextual and spatial cues through a multi-step process that mimics human geolocation reasoning. Finally, using the GeoEval metric, we demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
翻译:图像地理定位(Geolocation)作为识别图像拍摄位置的任务,需要复杂的推理能力,在导航、环境监测和文化保护等领域具有关键作用。然而,现有方法通常只能实现粗糙、不精确且难以解释的定位效果。当前面临的主要挑战在于现有地理定位数据集的质量与规模不足:这些数据集通常规模较小且通过自动化方式构建,导致数据噪声大、任务难度不一致,图像内容要么过于明显地暴露答案,要么缺乏足够线索以进行可靠推断。为应对这些挑战,我们提出了一个包含三个核心组件的综合地理定位框架:大规模数据集 GeoComp、新型推理方法 GeoCoT 以及评估指标 GeoEval,三者协同设计以解决关键难题并推动地理定位研究的发展。该框架的核心是 GeoComp(地理定位竞赛数据集),这是一个通过地理定位游戏平台收集的大规模数据集,涵盖两年间 74 万用户的参与记录。数据集包含 2500 万条元数据条目和 300 万个覆盖全球大部分地区的地理标记位置,每个位置均经过数千至数万次人工标注。该数据集提供了多样化的难度级别以支持细粒度分析,并揭示了当前模型存在的关键缺陷。基于此数据集,我们提出了地理思维链(Geographical Chain-of-Thought, GeoCoT)——一种新颖的多步推理框架,旨在增强大规模视觉模型(Large Vision Models, LVMs)在地理定位任务中的推理能力。GeoCoT 通过模拟人类地理定位推理的多步流程,整合上下文与空间线索,从而提升模型性能。最后,通过 GeoEval 评估指标,我们证明 GeoCoT 能够将地理定位准确率最高提升 25%,同时显著增强结果的可解释性。