Video Anomaly Detection (VAD) is an essential yet challenging task in signal processing. Since certain anomalies cannot be detected by isolated analysis of either temporal or spatial information, the interaction between these two types of data is considered crucial for VAD. However, current dual-stream architectures either confine this integral interaction to the bottleneck of the autoencoder or introduce anomaly-irrelevant background pixels into the interactive process, hindering the accuracy of VAD. To address these deficiencies, we propose a Multi-scale Spatial-Temporal Interaction Network (MSTI-Net) for VAD. First, to prioritize the detection of moving objects in the scene and harmonize the substantial semantic discrepancies between the two types of data, we propose an Attention-based Spatial-Temporal Fusion Module (ASTFM) as a substitute for the conventional direct fusion. Furthermore, we inject multi-ASTFM-based connections that bridge the appearance and motion streams of the dual-stream network, thus fostering multi-scale spatial-temporal interaction. Finally, to bolster the delineation between normal and abnormal activities, our system records the regular information in a memory module. Experimental results on three benchmark datasets validate the effectiveness of our approach, which achieves AUCs of 96.8%, 87.6%, and 73.9% on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, respectively.
翻译:视频异常检测(VAD)是信号处理中一项至关重要但具有挑战性的任务。由于某些异常无法通过单独分析时间或空间信息来检测,因此这两类数据之间的交互被认为对VAD至关重要。然而,当前的双流架构要么将这种整体交互限制在自编码器的瓶颈处,要么将异常无关的背景像素引入交互过程,从而阻碍了VAD的准确性。为解决这些不足,我们提出了一种用于VAD的多尺度时空交互网络(MSTI-Net)。首先,为优先检测场景中的运动目标并协调两类数据之间显著的语义差异,我们提出了一种基于注意力的时空融合模块(ASTFM),用以替代传统的直接融合方法。此外,我们引入了基于多ASTFM的连接,桥接双流网络的外观流和运动流,从而促进多尺度时空交互。最后,为增强正常与异常活动之间的区分度,我们的系统在记忆模块中记录了常规信息。在三个基准数据集上的实验结果验证了我们方法的有效性,其在UCSD Ped2、CUHK Avenue和ShanghaiTech数据集上分别达到了96.8%、87.6%和73.9%的AUC值。