The goal of weakly supervised video anomaly detection is to learn a detection model using only video-level labeled data. However, prior studies typically divide videos into fixed-length segments without considering the complexity or duration of anomalies. Moreover, these studies usually just detect the most abnormal segments, potentially overlooking the completeness of anomalies. To address these limitations, we propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection, which learns multi-scale temporal features. Specifically, to handle duration variations of abnormal events, we first propose a multi-scale temporal modeling module, capable of extracting features from segments of varying lengths and capturing both local and global visual information across different temporal scales. Then, we design a dynamic erasing strategy, which dynamically assesses the completeness of the detected anomalies and erases prominent abnormal segments in order to encourage the model to discover gentle abnormal segments in a video. The proposed method obtains favorable performance compared to several state-of-the-art approaches on three datasets: XD-Violence, TAD, and UCF-Crime. Code will be made available at https://github.com/ArielZc/DE-Net.
翻译:弱监督视频异常检测的目标是仅利用视频级别标注数据来学习检测模型。然而,现有研究通常将视频划分为固定长度片段,未考虑异常的复杂性或持续时间。此外,这些研究通常仅检测最异常的片段,可能忽略异常事件的完整性。为克服这些局限,我们提出了一种用于弱监督视频异常检测的动态擦除网络(Dynamic Erasing Network, DE-Net),该网络学习多尺度时序特征。具体而言,为处理异常事件持续时间的差异,我们首先提出多尺度时序建模模块,该模块能够提取不同长度片段的特征,并捕获不同时间尺度下的局部与全局视觉信息。随后,我们设计了一种动态擦除策略,该策略动态评估已检测异常的完整性,并擦除显著异常片段,以促使模型发现视频中的细微异常片段。在三个数据集(XD-Violence、TAD和UCF-Crime)上,所提方法相比多种前沿方法取得了更优性能。代码将在https://github.com/ArielZc/DE-Net公开。