Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at \textit{explaining} which image evidence led to their prediction, even when their location prediction is correct. In this paper, we introduce GeoRC, the first benchmark for geolocation reasoning chains sourced directly from Champion-tier GeoGuessr experts, including the reigning world champion. This benchmark consists of 800 ``ground truth'' reasoning chains across 500 query scenes from GeoGuessr maps, with expert chains addressing hundreds of different discriminative attributes, such as soil properties, architecture, and license plate shapes. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human-expert scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at predicting locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Small open-weight VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but \textit{no visual information at all}. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images. We open source our benchmark for the community to use.
翻译:视觉语言模型(VLM)擅长识别照片的全球位置——其地理定位预测准确度可与最优秀的人类专家媲美。但许多VLM在解释哪些图像证据导致了其预测时表现惊人地糟糕,即使定位预测正确也不例外。本文提出GeoRC,这是首个直接源自地球地理猜谜游戏(GeoGuessr)冠军级专家(包括现任世界冠军)的地理定位推理链基准。该基准包含源自GeoGuessr地图中500个查询场景的800条“真实”推理链,专家链涵盖数百种不同判别属性,例如土壤特性、建筑风格和车牌形状。我们评估了以大语言模型(LLM)为评判和以VLM为评判的策略,用于对VLM生成的推理链与专家推理链进行评分,发现以通义千问3(Qwen 3)大语言模型为评判的方案与人类专家评分的相关性最佳。我们的基准表明,尽管Gemini和GPT 5等大型闭源VLM在预测位置方面可与人类专家匹敌,但在生成可审计推理链方面仍落后于人类专家。Llama和Qwen等小型开源VLM在本基准上表现糟糕——它们仅略优于一个基线,在该基线中,大语言模型在知晓照片位置但完全缺乏视觉信息的条件下幻觉式生成推理链。我们认为,人类专家与VLM在此任务上的差距,揭示了VLM在从高分辨率图像中提取细粒度视觉属性方面的局限性。我们开源此基准供社区使用。