Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents. Our code and data are provided in https://github.com/sjyu001/WaymoQA
翻译:近年来,多模态大语言模型在驾驶场景理解方面展现出强大能力,引发了将其应用于自动驾驶领域的兴趣。然而,在安全关键场景中,避免一个交通风险可能引发另一个风险,此类高层级推理仍是主要挑战。仅凭单一前视图通常无法实现此类推理,需要环境的全面视角,我们通过多视角输入实现这一目标。我们将安全关键推理定义为一个新任务,利用多视角输入应对这一挑战。进而,我们将安全关键推理提炼为两个阶段:首先化解即时风险,随后缓解决策引发的下游风险。为此,我们推出WaymoQA数据集,包含35,000个人工标注的问答对,涵盖复杂高风险驾驶场景。该数据集包含图像与视频两种模态下的多项选择与开放式问答形式。实验表明,现有多模态大语言模型在安全关键场景中的表现显著低于正常场景,但使用WaymoQA进行微调可大幅提升其推理能力,这凸显了本数据集在开发更安全、更具推理能力的驾驶智能体方面的有效性。我们的代码与数据公开于https://github.com/sjyu001/WaymoQA