Existing pedestrian behavior prediction methods rely primarily on deep neural networks that utilize features extracted from video frame sequences. Although these vision-based models have shown promising results, they face limitations in effectively capturing and utilizing the dynamic spatio-temporal interactions between the target pedestrian and its surrounding traffic elements, crucial for accurate reasoning. Additionally, training these models requires manually annotating domain-specific datasets, a process that is expensive, time-consuming, and difficult to generalize to new environments and scenarios. The recent emergence of Large Multimodal Models (LMMs) offers potential solutions to these limitations due to their superior visual understanding and causal reasoning capabilities, which can be harnessed through semi-supervised training. GPT-4V(ision), the latest iteration of the state-of-the-art Large-Language Model GPTs, now incorporates vision input capabilities. This report provides a comprehensive evaluation of the potential of GPT-4V for pedestrian behavior prediction in autonomous driving using publicly available datasets: JAAD, PIE, and WiDEVIEW. Quantitative and qualitative evaluations demonstrate GPT-4V(ision)'s promise in zero-shot pedestrian behavior prediction and driving scene understanding ability for autonomous driving. However, it still falls short of the state-of-the-art traditional domain-specific models. Challenges include difficulties in handling small pedestrians and vehicles in motion. These limitations highlight the need for further research and development in this area.
翻译:现有行人行为预测方法主要依赖基于视频帧序列提取特征的深度神经网络。尽管这些视觉模型取得了令人瞩目的成果,但在有效捕捉并利用目标行人与其周围交通要素之间的动态时空交互关系方面仍存在局限,而这对于精确推理至关重要。此外,训练这类模型需要人工标注特定领域数据集,这一过程成本高昂、耗时费力,且难以泛化至新环境与场景。近期大型多模态模型(LMMs)的出现,凭借其卓越的视觉理解与因果推理能力(可通过半监督训练实现),为突破上述局限提供了潜在方案。GPT-4V(ision)作为最先进大型语言模型GPTs的最新迭代版本,现已集成视觉输入功能。本报告利用公开数据集JAAD、PIE及WiDEVIEW,全面评估了GPT-4V在自动驾驶场景下行人行为预测方面的潜力。定量与定性评估表明,GPT-4V(ision)在零样本行人行为预测及自动驾驶场景理解能力方面展现出应用前景,但仍未达到传统专用领域模型的最优水平。现有挑战包括难以准确处理运动中的微小行人与车辆。这些局限性凸显了该领域进一步研究的必要性。