Predicting pedestrian behavior is the key to ensure safety and reliability of autonomous vehicles. While deep learning methods have been promising by learning from annotated video frame sequences, they often fail to fully grasp the dynamic interactions between pedestrians and traffic, crucial for accurate predictions. These models also lack nuanced common sense reasoning. Moreover, the manual annotation of datasets for these models is expensive and challenging to adapt to new situations. The advent of Vision Language Models (VLMs) introduces promising alternatives to these issues, thanks to their advanced visual and causal reasoning skills. To our knowledge, this research is the first to conduct both quantitative and qualitative evaluations of VLMs in the context of pedestrian behavior prediction for autonomous driving. We evaluate GPT-4V(ision) on publicly available pedestrian datasets: JAAD and WiDEVIEW. Our quantitative analysis focuses on GPT-4V's ability to predict pedestrian behavior in current and future frames. The model achieves a 57% accuracy in a zero-shot manner, which, while impressive, is still behind the state-of-the-art domain-specific models (70%) in predicting pedestrian crossing actions. Qualitatively, GPT-4V shows an impressive ability to process and interpret complex traffic scenarios, differentiate between various pedestrian behaviors, and detect and analyze groups. However, it faces challenges, such as difficulty in detecting smaller pedestrians and assessing the relative motion between pedestrians and the ego vehicle.
翻译:预测行人行为是确保自动驾驶汽车安全性与可靠性的关键。尽管深度学习方法通过从标注的视频帧序列中学习展现出潜力,但这类模型往往难以全面把握行人与交通之间的动态交互——这正是精准预测的必要条件。同时,这些模型也缺乏细微的常识推理能力。此外,为这类模型手动标注数据集成本高昂且难以适应新场景。视觉语言模型(VLM)的出现为解决这些问题提供了具有前景的替代方案,这得益于其先进的视觉与因果推理能力。据我们所知,本研究首次在自动驾驶行人行为预测场景中对VLM进行定量与定性评估。我们在公开行人数据集JAAD和WiDEVIEW上对GPT-4V(视觉版)进行了评估。定量分析聚焦于GPT-4V预测当前及未来帧中行人行为的能力。该模型以零样本方式达到57%的准确率,这一表现虽令人瞩目,但在预测行人穿越动作方面仍落后于最先进的领域专用模型(70%)。在定性分析中,GPT-4V展现出令人印象深刻的复杂交通场景处理与解读能力,能够区分不同行人行为,并检测与分析群体。然而,它仍面临挑战,例如难以检测较小尺寸的行人,以及难以评估行人与自车之间的相对运动。