Detection of anomaly events is relevant for public safety and requires a combination of fine-grained motion information and contextual events at variable time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL outperforms state-of-the-art methods on the UCF-Crime dataset, achieving an anomaly detection performance 89.78% AUC. Moreover, it performs complementary to SotA with 95.32% AUC on the ShanghaiTech and 84.57% AP on the XD-Violence dataset. Furthermore, we generate an extended dataset of the UCF-Crime for development and evaluation on a wider range of anomalies, namely Video Anomaly Detection Dataset (VADD), involving 2,591 videos in 18 classes with extensive coverage of realistic anomalies.
翻译:异常事件检测对公共安全至关重要,需要结合细粒度运动信息与可变时间尺度下的上下文事件。为此,我们提出一种多时间尺度特征学习(MTFL)方法,以增强异常特征的表征能力。该方法采用短、中、长三种时间尺度的时序片段,通过Video Swin Transformer提取视频的时空特征。实验结果表明,MTFL在UCF-Crime数据集上优于现有最优方法,异常检测性能达到89.78% AUC。此外,该方法在ShanghaiTech数据集上以95.32% AUC、在XD-Violence数据集上以84.57% AP的表现与当前最优方法形成互补。进一步地,我们构建了UCF-Crime的扩展数据集——视频异常检测数据集(VADD),包含18个类别共2,591段视频,广泛覆盖现实场景中的异常类型,以支持更广泛异常类型的开发与评估。