GTR-Bench：评估视觉语言模型中的地理时空推理能力 (GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models)

Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.

翻译：近年来，视觉语言模型（VLMs）的时空智能因其在自动驾驶、具身人工智能和通用人工智能领域的重要性而备受关注。现有的时空基准测试主要集中于利用图像/视频上下文进行以自我为中心的视角推理，或利用图形上下文（如地图）进行地理视角推理，因而未能评估VLMs在同时具备图像/视频和图形上下文情况下的地理时空智能，而这对于交通管理和应急响应等领域至关重要。为填补这一空白，我们提出了地理时空推理基准测试（GTR-Bench），这是一个针对大规模摄像头网络中移动目标进行地理时空推理的新颖挑战。GTR-Bench更具挑战性，因为它需要在地图与视频之间进行多次视角切换、对具有非重叠视野的多个视频进行联合推理，以及对任何视频上下文均未观测到的时空区域进行推断。在GTR-Bench上对超过10个流行VLM的评估表明，即使是最佳的专有模型Gemini-2.5-Pro（34.9%），其地理时空推理能力也显著落后于人类表现（78.61%）。此外，我们在GTR-Bench上的综合分析揭示了当前模型在地理时空推理方面的三个主要缺陷：（1）VLM的推理因对时空上下文利用不平衡而受损。（2）VLM在时间预测方面表现较弱，导致其在时间强调型任务上的表现比空间强调型任务更差。（3）VLM缺乏理解地图数据或将其与多视角视频输入对齐的能力。我们相信GTR-Bench为时空智能的研究与应用提供了宝贵的见解并开辟了新的机遇。基准测试与代码将在 https://github.com/X-Luffy/GTR-Bench 发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日