We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First, we introduce an approach to weight tokens based on motion gradients, thus avoiding learning to reconstruct the static background scene. Second, we integrate a teacher decoder and a student decoder into our architecture, leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third, we generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model, as demonstrated by the extensive experiments carried out on three benchmarks: Avenue, ShanghaiTech and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy, obtaining competitive AUC scores, while processing 1670 FPS. Hence, our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design.
翻译:我们提出了一种基于轻量级掩码自编码器(AE)的高效异常事件检测模型,该模型在视频帧级别上运行。本模型的创新性体现在三个方面。首先,我们引入了一种基于运动梯度对令牌进行加权的方法,从而避免学习重建静态背景场景。其次,我们集成了教师解码器与学生解码器,利用两个解码器输出之间的差异来提升异常检测性能。第三,我们通过生成合成异常事件来扩充训练视频,并让掩码AE模型联合重建原始帧(不含异常)及对应的像素级异常图。在Avenue、ShanghaiTech和UCSD Ped2三个基准数据集上的大量实验表明,我们的设计方案实现了高效且有效的模型。实验结果显示,该模型在速度与准确性之间取得了优异平衡,在获得具有竞争力的AUC分数的同时,处理速度达到1670 FPS。因此,我们的模型比现有竞争方法快8到70倍。此外,我们还通过消融研究验证了设计方案的合理性。