UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Joseph Raj Vishal,Nagasiri Poluri,Katha Naik,Rutuja Patil,Kashyap Hegde Kota,Krishna Vinod,Prithvi Jai Ramesh,Mohammad Farhadi,Yezhou Yang,Bharatesh Chakravarthi

Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.

翻译：理解城市交通中复杂的多智能体动态行为，仍然是视频语言模型面临的一项基础性挑战。本文提出了Urban Dynamics VideoQA（城市动态视频问答）基准数据集，该数据集捕捉了动态城市场景中未经编排的真实世界行为。UDVideoQA数据集从多个城市交叉路口、在不同交通状况、天气和光照条件下录制的16小时交通录像中精心筛选而成。它采用了一种事件驱动的动态模糊技术，在确保隐私保护的同时不损害场景保真度。通过统一的标注流程，该数据集包含了在8小时密集标注视频中生成的28K个问答对，平均每秒一个问题。其分类体系遵循分层推理级别，涵盖基础理解与归因、事件推理、逆向推理以及反事实推断，从而能够系统性地评估视觉定位和因果推理能力。全面的实验在UDVideoQA上对10个最先进的视频语言模型进行了基准测试，并在一个补充性的视频问题生成基准上评估了8个模型。结果揭示了一个持续存在的感知-推理鸿沟：在抽象推理方面表现出色的模型，往往在基础的视觉定位任务上失败。虽然像Gemini Pro这样的模型在零样本准确率上达到最高，但在UDVideoQA上对较小的Qwen2.5-VL 7B模型进行微调，能够弥合这一鸿沟，使其性能达到与专有系统相当的水平。在视频问题生成任务中，Gemini 2.5 Pro和Qwen3 Max生成了最相关和最复杂的问题，但所有模型都表现出有限的语言多样性，这突显了以人为中心的评估的必要性。UDVideoQA套件，包括数据集、标注工具以及针对视频问答和视频问题生成的基准，为推进鲁棒、隐私感知且面向真实世界的多模态推理奠定了基础。UDVideoQA可通过 https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/ 获取。