Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding TimeSpot is available at: https://TimeSpot-GT.github.io.
翻译:地理时间理解能力是指仅从视觉输入推断位置、时间及上下文属性的能力,支撑着灾害管理、交通规划、具身导航、世界建模和地理教育等应用。尽管近期视觉语言模型(VLM)已能通过地标、路标等线索实现图像地理定位,但其对时间信号及基于物理空间线索的推理能力仍十分有限。为弥补这一空白,我们提出了TimeSpot——一个评估VLM在真实场景中地理时间推理能力的基准。TimeSpot包含来自80个国家的1,455张地面视角图像,要求模型基于视觉证据直接对时间属性(季节、月份、时段、光照阶段)和地理属性(大洲、国家、气候带、环境类型、经纬度)进行结构化预测,并设置了在真实世界不确定性下测试物理合理性的时空推理任务。对当前最先进的开源与闭源VLM评估显示,其性能普遍较低,尤其在时间推理方面。尽管监督微调带来了一定改进,但结果仍不理想,凸显了发展新方法以实现稳健、基于物理空间的地理时间理解的必要性。TimeSpot代码公开于:https://TimeSpot-GT.github.io