Learning discriminative features for effectively separating abnormal events from normality is crucial for weakly supervised video anomaly detection (WS-VAD) tasks. Existing approaches, both video and segment-level label oriented, mainly focus on extracting representations for anomaly data while neglecting the implication of normal data. We observe that such a scheme is sub-optimal, i.e., for better distinguishing anomaly one needs to understand what is a normal state, and may yield a higher false alarm rate. To address this issue, we propose an Uncertainty Regulated Dual Memory Units (UR-DMU) model to learn both the representations of normal data and discriminative features of abnormal data. To be specific, inspired by the traditional global and local structure on graph convolutional networks, we introduce a Global and Local Multi-Head Self Attention (GL-MHSA) module for the Transformer network to obtain more expressive embeddings for capturing associations in videos. Then, we use two memory banks, one additional abnormal memory for tackling hard samples, to store and separate abnormal and normal prototypes and maximize the margins between the two representations. Finally, we propose an uncertainty learning scheme to learn the normal data latent space, that is robust to noise from camera switching, object changing, scene transforming, etc. Extensive experiments on XD-Violence and UCF-Crime datasets demonstrate that our method outperforms the state-of-the-art methods by a sizable margin.
翻译:学习能有效区分异常事件与正常行为的判别性特征,对于弱监督视频异常检测(WS-VAD)任务至关重要。现有方法(无论基于视频级还是片段级标签)主要聚焦于提取异常数据的表征,却忽略了正常数据蕴含的信息。我们观察到这种策略存在次优性——即若要更好地区分异常,需先理解何为正常状态,否则可能引发更高的误报率。为解决该问题,我们提出一种基于不确定性调节的双记忆单元模型(UR-DMU),能够同时学习正常数据的表征与异常数据的判别性特征。具体而言,受传统图卷积网络全局与局部结构的启发,我们为Transformer网络引入全局-局部多头自注意力模块(GL-MHSA),以获得更具表达力的嵌入表示,从而捕捉视频中的关联信息。随后,我们采用两个记忆库(其中额外增设一个异常记忆单元处理困难样本)分别存储并分离异常原型与正常原型,最大化两类表征之间的间隔。最后,我们提出一种不确定性学习机制来构建正常数据的潜在空间,该机制对摄像头切换、目标变化、场景转换等噪声具有鲁棒性。在XD-Violence和UCF-Crime数据集上的大量实验表明,我们的方法以显著优势超越了当前最先进方法。