Geo-localization aims to infer the geographic location where an image was captured using observable visual evidence. Traditional methods achieve impressive results through large-scale training on massive image corpora. With the emergence of multi-modal large language models (MLLMs), recent studies have explored their applications in geo-localization, benefiting from improved accuracy and interpretability. However, existing benchmarks largely ignore the temporal information inherent in images, which can further constrain the location. To bridge this gap, we introduce GTPred, a novel benchmark for geo-temporal prediction. GTPred comprises 370 globally distributed images spanning over 120 years. We evaluate MLLM predictions by jointly considering year and hierarchical location sequence matching, and further assess intermediate reasoning chains using meticulously annotated ground-truth reasoning processes. Experiments on 8 proprietary and 7 open-source MLLMs show that, despite strong visual perception, current models remain limited in world knowledge and geo-temporal reasoning. Results also demonstrate that incorporating temporal information significantly enhances location inference performance.
翻译:地理定位旨在利用图像中可观察的视觉证据推断其拍摄地理位置。传统方法通过对海量图像数据进行大规模训练取得了显著成果。随着多模态大语言模型的出现,近期研究开始探索其在地理定位中的应用,得益于模型在准确性与可解释性方面的提升。然而,现有基准数据集普遍忽略了图像内蕴的时间信息,而该信息能进一步约束位置推断。为填补这一空白,我们提出了GTPred——一个面向地理时空预测的新型基准数据集。GTPred包含370张全球分布、时间跨度超过120年的图像。我们通过联合评估年份与分层位置序列匹配来检验MLLM的预测能力,并利用精细标注的真实推理过程进一步评估其中间推理链。在8个专有模型与7个开源MLLM上的实验表明:尽管当前模型具备较强的视觉感知能力,但其世界知识与地理时空推理能力仍存在局限。结果同时证实,融合时间信息能显著提升地理位置推断性能。