The ability to understand the surrounding scene is of paramount importance for Autonomous Vehicles (AVs). This paper presents a system capable to work in a real time guaranteed response times and online fashion, giving an immediate response to the arise of anomalies surrounding the AV, exploiting only the videos captured by a dash-mounted camera. Our architecture, called MOVAD, relies on two main modules: a short-term memory to extract information related to the ongoing action, implemented by a Video Swin Transformer adapted to work in an online scenario, and a long-term memory module that considers also remote past information thanks to the use of a Long-Short Term Memory (LSTM) network. We evaluated the performance of our method on Detection of Traffic Anomaly (DoTA) dataset, a challenging collection of dash-mounted camera videos of accidents. After an extensive ablation study, MOVAD is able to reach an AUC score of 82.11%, surpassing the current state-of-the-art by +2.81 AUC. Our code will be available on https://github.com/IMPLabUniPr/movad/tree/icip
翻译:理解周围场景的能力对于自动驾驶车辆(AVs)至关重要。本文提出一个能够以实时保证响应时间和在线方式工作的系统,仅利用车载摄像头捕捉的视频,即可立即对AV周围出现的异常做出响应。我们的架构称为MOVAD,依赖两个主要模块:一个短期记忆模块,用于提取与当前动作相关的信息,该模块通过适应在线场景的视频Swin Transformer实现;一个长期记忆模块,借助长短时记忆(LSTM)网络,同时考虑远程历史信息。我们在交通异常检测(DoTA)数据集上评估了该方法,该数据集是一个包含事故的车载摄像头视频的挑战性集合。经过广泛消融研究,MOVAD能够达到82.11%的AUC得分,超越当前最先进水平2.81个AUC点。我们的代码将在https://github.com/IMPLabUniPr/movad/tree/icip 公开。