The ability to understand the surrounding scene is of paramount importance for Autonomous Vehicles (AVs). This paper presents a system capable to work in an online fashion, giving an immediate response to the arise of anomalies surrounding the AV, exploiting only the videos captured by a dash-mounted camera. Our architecture, called MOVAD, relies on two main modules: a Short-Term Memory Module to extract information related to the ongoing action, implemented by a Video Swin Transformer (VST), and a Long-Term Memory Module injected inside the classifier that considers also remote past information and action context thanks to the use of a Long-Short Term Memory (LSTM) network. The strengths of MOVAD are not only linked to its excellent performance, but also to its straightforward and modular architecture, trained in a end-to-end fashion with only RGB frames with as less assumptions as possible, which makes it easy to implement and play with. We evaluated the performance of our method on Detection of Traffic Anomaly (DoTA) dataset, a challenging collection of dash-mounted camera videos of accidents. After an extensive ablation study, MOVAD is able to reach an AUC score of 82.17\%, surpassing the current state-of-the-art by +2.87 AUC. Our code will be available on https://github.com/IMPLabUniPr/movad/tree/movad_vad
翻译:理解周围场景的能力对于自动驾驶汽车(AVs)至关重要。本文提出了一种能够在线工作的系统,仅利用车载摄像头拍摄的视频,即可对自动驾驶车辆周围出现的异常做出即时响应。我们的架构名为MOVAD,依赖于两个主要模块:一个用于提取当前动作相关信息的短期记忆模块(由Video Swin Transformer, VST实现),以及一个注入到分类器内部的长期记忆模块——该模块借助长短期记忆(LSTM)网络,同时考虑远程历史信息和动作上下文。MOVAD的优势不仅在于其卓越的性能,还在于其简洁且模块化的架构——该模型仅以RGB帧作为输入,在尽可能少的假设条件下进行端到端训练,因此易于实现和调试。我们在交通异常检测(DoTA)数据集上评估了该方法,该数据集是一个具有挑战性的车载摄像头事故视频集合。经过大量消融研究,MOVAD取得了82.17%的AUC分数,比当前最先进方法高出+2.87 AUC。我们的代码将开源在https://github.com/IMPLabUniPr/movad/tree/movad_vad。