Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole

翻译：弱监督视频异常检测（WSVAD）是一项具有挑战性的任务。基于弱标签生成细粒度伪标签，进而对分类器进行自训练，是目前一种有前景的解决方案。然而，现有方法仅使用RGB视觉模态，忽视了类别文本信息的利用，从而限制了更准确伪标签的生成，并影响了自训练的性能。受基于事件描述的人工标注过程启发，本文提出了一种新颖的基于常态引导的文本提示（TPWNG）伪标签生成与自训练框架，用于弱监督视频异常检测。我们的核心思想是迁移对比语言-图像预训练（CLIP）模型丰富的语言-视觉知识，通过对齐视频事件描述文本与相应视频帧来生成伪标签。具体而言，我们首先通过设计两种排序损失和一种分布不一致性损失对CLIP进行微调，以实现领域自适应。进一步，我们提出一种可学习的文本提示机制，并辅以常态视觉提示，以提升视频事件描述文本与视频帧的匹配精度。随后，我们设计了一个基于常态引导的伪标签生成模块，用于推断可靠的帧级伪标签。最后，我们引入时间上下文自适应学习模块，更灵活、更准确地学习不同视频事件的时间依赖关系。大量实验表明，我们的方法在UCF-Crime和XD-Viole两个基准数据集上达到了最先进的性能。