We introduce the first audio-visual dataset for traffic anomaly detection taken from real-world scenes, called MAVAD, with a diverse range of weather and illumination conditions. In addition, we propose a novel method named AVACA that combines visual and audio features extracted from video sequences by means of cross-attention to detect anomalies. We demonstrate that the addition of audio improves the performance of AVACA by up to 5.2%. We also evaluate the impact of image anonymization, showing only a minor decrease in performance averaging at 1.7%.
翻译:我们提出了首个面向真实场景的交通异常检测视听数据集MAVAD,该数据集涵盖多样化的天气与光照条件。此外,我们提出了一种名为AVACA的新方法,该方法通过交叉注意力机制融合从视频序列中提取的视觉与音频特征以检测异常。实验证明,音频信息的引入使AVACA的性能提升最高达5.2%。同时,我们评估了图像匿名化的影响,结果表明性能仅平均下降1.7%。