Intelligent vehicles have demonstrated excellent capabilities in many transportation scenarios, but the complex on-board sensors and the inference capabilities of on-board neural networks limit the accuracy of intelligent vehicles for accident detection in complex transportation systems. In this paper, we present AccidentBlip2, a pure vision-based multimodal large model Blip2 accident detection method. Our method first processes the multi-view through ViT-14g and inputs the multi-view features into the cross attention layer of the Qformer, while our self-designed Motion Qformer replaces the self-attention layer in Blip2's Qformer with the Temporal Attention layer in the In the inference process, the query generated in the previous frame is input into the Temporal Attention layer to realize the inference for temporal information. Then we detect whether there is an accident in the surrounding environment by performing autoregressive inference on the query input to the MLP. We also extend our approach to a multi-vehicle cooperative system by deploying Motion Qformer on each vehicle and simultaneously inputting the inference-generated query into the MLP for autoregressive inference. Our approach detects the accuracy of existing video large language models and also adapts to multi-vehicle systems, making it more applicable to intelligent transportation scenarios.
翻译:智能车辆在许多交通场景中已展现出卓越的能力,但复杂的车载传感器以及车载神经网络的推理能力限制了其在复杂交通系统中进行事故检测的准确性。本文提出AccidentBlip2——一种基于纯视觉的多模态大模型Blip2的事故检测方法。该方法首先通过ViT-14g处理多视角图像,并将多视角特征输入Qformer的交叉注意力层;同时,我们设计的Motion Qformer将Blip2中Qformer的自注意力层替换为时序注意力层。在推理过程中,将上一帧生成的查询输入时序注意力层,以实现对时序信息的推理。随后,通过对输入到多层感知机(MLP)的查询进行自回归推理,检测周围环境中是否存在事故。我们还将该方法扩展到多车协同系统中:在每辆车上部署Motion Qformer,并同时将推理生成的查询输入MLP进行自回归推理。该方法不仅准确检测了现有视频大语言模型的性能,还适用于多车系统,从而更适用于智能交通场景。