We propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale dataset with 3 million Q&As built on WOMD focusing on describing and reasoning interactions and intentions in driving scenarios. Existing language datasets for driving primarily capture interactions caused by close distances. However, interactions induced by traffic rules and human intentions, which can occur over long distances, are yet sufficiently covered. To address this, WOMD-Reasoning presents by far the largest multi-modal Q&A dataset on real-world driving scenarios, covering a wide range of driving topics from map descriptions and motion status descriptions to narratives and analyses of agents' interactions, behaviors, and intentions. We further introduce Motion-LLaVA, a motion-language model fine-tuned on the proposed dataset with robust interaction reasoning capabilities. We benchmark its performance across various configurations including different input modalities, reasoning techniques, and network architectures. The robust, diverse, and multi-modal nature of WOMD-Reasoning highlights its potential to advance future autonomous driving research and enable a broad range of applications. The dataset and its vision modal extension are available at https://waymo.com/open/download, and the codes & prompts to build it are available at https://github.com/yhli123/WOMD-Reasoning.
翻译:我们提出了Waymo开放运动数据集-推理(WOMD-Reasoning),这是一个基于WOMD构建的综合性大规模数据集,包含300万个问答对,专注于描述和推理驾驶场景中的交互与意图。现有的驾驶语言数据集主要捕捉由近距离引起的交互。然而,由交通规则和人类意图引发的、可能发生在长距离范围内的交互尚未得到充分覆盖。为解决这一问题,WOMD-Reasoning提出了迄今为止最大的真实世界驾驶场景多模态问答数据集,涵盖了从地图描述、运动状态描述到智能体交互、行为与意图的叙述与分析等广泛的驾驶主题。我们进一步提出了Motion-LLaVA,这是一个在所提数据集上微调、具有强大交互推理能力的运动-语言模型。我们在不同配置下对其性能进行了基准测试,包括不同的输入模态、推理技术和网络架构。WOMD-Reasoning的鲁棒性、多样性和多模态特性凸显了其推动未来自动驾驶研究并支持广泛应用的潜力。该数据集及其视觉模态扩展可在 https://waymo.com/open/download 获取,构建数据集的代码与提示可在 https://github.com/yhli123/WOMD-Reasoning 获取。