Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. The ambiguous nature of anomaly definitions across contexts introduces bias in detecting abnormal and normal snippets within the abnormal bag. Taking the first step to show the model why it is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected anomalous events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (82.6\%, 87.7\%, 93.1\%, and 97.4\%). Furthermore, it shows promising performance in open-set and cross-dataset cases.
翻译:大多数弱监督视频异常检测(WS-VAD)模型依赖多实例学习,旨在区分正常与异常片段,但未明确指定异常类型。异常定义在不同情境下的模糊性,导致在异常包中检测异常与正常片段时产生偏差。为首次向模型展示其为何被视为异常,本文提出一种新颖框架,通过事件提示引导疑似异常的学习。基于潜在异常事件的文本提示字典与异常视频生成的描述文本,可计算两者间的语义异常相似度,从而识别每个视频片段对应的疑似异常事件。该框架不仅实现了新型多提示学习过程以约束所有视频的视觉-语义特征,还提供了一种为伪异常标注的新方法用于自训练。为验证有效性,在XD-Violence、UCF-Crime、TAD和ShanghaiTech四个数据集上进行了全面实验与详细消融研究。所提模型在AP或AUC指标上(分别为82.6%、87.7%、93.1%和97.4%)优于多数现有最优方法,并在开放集与跨数据集场景中展现出优异性能。