Addressing hard cases in autonomous driving, such as anomalous road users, extreme weather conditions, and complex traffic interactions, presents significant challenges. To ensure safety, it is crucial to detect and manage these scenarios effectively for autonomous driving systems. However, the rarity and high-risk nature of these cases demand extensive, diverse datasets for training robust models. Vision-Language Foundation Models (VLMs) have shown remarkable zero-shot capabilities as being trained on extensive datasets. This work explores the potential of VLMs in detecting hard cases in autonomous driving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard cases in traffic participant motion prediction on both agent and scenario levels. We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify challenging agents or scenarios, which are verified by existing prediction models. Moreover, by taking advantage of this detection of hard cases by VLMs, we further improve the training efficiency of the existing motion prediction pipeline by performing data selection for the training samples suggested by GPT. We show the effectiveness and feasibility of our pipeline incorporating VLMs with state-of-the-art methods on NuScenes datasets. The code is accessible at https://github.com/KTH-RPL/Detect_VLM.
翻译:自动驾驶中处理困难案例(如异常道路使用者、极端天气条件和复杂交通交互)面临重大挑战。为确保安全,自动驾驶系统必须有效检测并管理这些场景。然而,这些案例的罕见性和高风险特性要求使用广泛且多样化的数据集来训练鲁棒模型。视觉-语言基础模型(VLMs)通过在大量数据集上进行训练,已展现出卓越的零样本能力。本研究探索了VLMs在自动驾驶困难案例检测中的潜力。我们展示了如GPT-4v等VLMs在交通参与者运动预测中,于智能体层面和场景层面检测困难案例的能力。我们提出了一种可行流程:通过向VLMs输入带有设计提示的连续图像帧,可有效识别具有挑战性的智能体或场景,并经现有预测模型验证。此外,通过利用VLMs对困难案例的检测能力,我们基于GPT建议的训练样本进行数据选择,进一步提升了现有运动预测流程的训练效率。我们在NuScenes数据集上结合最先进方法,验证了该整合VLM流程的有效性与可行性。代码可通过https://github.com/KTH-RPL/Detect_VLM访问。