We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.
翻译:本文提出RoadSocial,一个面向社交媒体叙事中通用道路事件理解的大规模、多样化视频问答数据集。与现有受限于区域偏见、视角偏见及专家驱动标注的数据集不同,RoadSocial通过多样化的地理分布、摄像机视角(闭路电视、手持设备、无人机)以及丰富的社会话语,捕捉了道路事件的全球复杂性。我们提出的可扩展半自动标注框架利用文本大语言模型与视频大语言模型,在12项具有挑战性的问答任务中生成全面的问答对,从而拓展了道路事件理解的边界。RoadSocial源自社交媒体视频,涵盖1400万帧画面与41.4万条社交评论,最终构建成包含1.32万个视频、674个标签及26万个高质量问答对的数据集。我们在道路事件理解基准上评估了18种视频大语言模型(开源与专有、驾驶专用与通用型)。我们还验证了RoadSocial在提升通用视频大语言模型道路事件理解能力方面的实用价值。