Intelligent vehicles have demonstrated excellent capabilities in many transportation scenarios. The inference capabilities of neural networks using cameras limit the accuracy of accident detection in complex transportation systems. This paper presents AccidentBlip2, a pure vision-based multi-modal large model Blip2 for accident detection. Our method first processes the multi-view images through ViT-14g and sends the multi-view features into the cross-attention layer of Q-Former. Different from Blip2's Q-Former, our Motion Q-Former extends the self-attention layer with the temporal-attention layer. In the inference process, the queries generated from previous frames are input into Motion Q-Former to aggregate temporal information. Queries are updated with an auto-regressive strategy and are sent to a MLP to detect whether there is an accident in the surrounding environment. Our AccidentBlip2 can be extended to a multi-vehicle cooperative system by deploying Motion Q-Former on each vehicle and simultaneously fusing the generated queries into the MLP for auto-regressive inference. Our approach outperforms existing video large language models in detection accuracy in both single-vehicle and multi-vehicle systems.
翻译:智能车辆已在众多交通场景中展现出卓越能力。基于摄像头的神经网络推理能力限制了复杂交通系统中事故检测的准确性。本文提出AccidentBlip2——一种基于纯视觉的多模态大模型Blip2,用于交通事故检测。我们的方法首先通过ViT-14g处理多视角图像,并将多视角特征送入Q-Former的交叉注意力层。与Blip2的Q-Former不同,我们的运动Q-Former在自注意力层中扩展了时间注意力层。在推理过程中,从前序帧生成的查询被输入运动Q-Former以聚合时序信息。查询采用自回归策略进行更新,并送入多层感知机以检测周围环境是否存在事故。通过将运动Q-Former部署于每辆车辆,并将生成的查询同步融合至进行自回归推理的多层感知机,我们的AccidentBlip2可扩展至多车协同系统。在单车与多车系统中,本方法在检测精度上均优于现有视频大语言模型。