Multimodal Large Language Models (MLLMs) have shown outstanding capabilities in many areas of multimodal reasoning. Therefore, we use the reasoning ability of Multimodal Large Language Models for environment description and scene understanding in complex transportation environments. In this paper, we propose AccidentBlip2, a multimodal large language model that can predict in real time whether an accident risk will occur. Our approach involves feature extraction based on the temporal scene of the six-view surround view graphs and temporal inference using the temporal blip framework through the vision transformer. We then input the generated temporal token into the MLLMs for inference to determine whether an accident will occur or not. Since AccidentBlip2 does not rely on any BEV images and LiDAR, the number of inference parameters and the inference cost of MLLMs can be significantly reduced, and it also does not incur a large training overhead during training. AccidentBlip2 outperforms existing solutions on the DeepAccident dataset and can also provide a reference solution for end-to-end automated driving accident prediction.
翻译:多模态大语言模型(MLLMs)在多模态推理的诸多领域展现出卓越能力。为此,我们利用多模态大语言模型的推理能力,对复杂交通环境进行环境描述与场景理解。本文提出AccidentBlip2——一种能够实时预测事故风险是否发生的多模态大语言模型。我们的方法基于六视角环视图像的时序场景进行特征提取,并通过视觉变换器采用时序blip框架进行时序推理。随后,将生成的时序令牌输入多模态大语言模型进行推理,以判断是否会发生事故。由于AccidentBlip2不依赖任何鸟瞰图图像和激光雷达,可以显著减少多模态大语言模型的推理参数数量与推理成本,且在训练过程中不会产生大量训练开销。在DeepAccident数据集上,AccidentBlip2的性能优于现有解决方案,同时可为端到端自动驾驶事故预测提供参考方案。