Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.
翻译:图像地理定位传统上通过基于检索的地点识别或基于几何的视觉定位流程来解决。近期视觉语言模型(VLMs)的进展已在多模态任务中展现出强大的零样本推理能力,然而它们在地理推断方面的表现仍未被充分探索。本文仅利用地面视角图像,对多种最先进视觉语言模型在国家层面图像地理定位任务中进行了系统性评估。不同于依赖图像匹配、GPS元数据或任务特定训练,我们在零样本设置下评估基于提示的国家预测。所选模型在三个地理分布不同的数据集上进行测试以评估其鲁棒性和泛化能力。我们的结果揭示了模型间的显著差异,凸显了语义推理在粗粒度地理定位中的潜力,以及当前视觉语言模型在捕捉细粒度地理线索方面的局限性。本研究首次对现代视觉语言模型在国家层面地理定位任务进行了聚焦比较,并为多模态推理与地理理解交叉领域的未来研究奠定了基础。