Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Video anomaly detection under weak supervision is challenging due to the absence of frame-level annotations during the training phase. Previous work has employed graph convolution networks or self-attention mechanisms to model temporal relations, along with multiple instance learning (MIL)-based classification loss to learn discriminative features. However, most of them utilize multi-branches to capture local and global dependencies separately, leading to increased parameters and computational cost. Furthermore, the binarized constraint of the MIL-based loss only ensures coarse-grained interclass separability, ignoring fine-grained discriminability within anomalous classes. In this paper, we propose a weakly supervised anomaly detection framework that emphasizes efficient context modeling and enhanced semantic discriminability. To this end, we first construct a temporal context aggregation (TCA) module that captures complete contextual information by reusing similarity matrix and adaptive fusion. Additionally, we propose a prompt-enhanced learning (PEL) module that incorporates semantic priors into the model by utilizing knowledge-based prompts, aiming at enhancing the discriminative capacity of context features while ensuring separability between anomaly sub-classes. Furthermore, we introduce a score smoothing (SS) module in the testing phase to suppress individual bias and reduce false alarms. Extensive experiments demonstrate the effectiveness of various components of our method, which achieves competitive performance with fewer parameters and computational effort on three challenging benchmarks: the UCF-crime, XD-violence, and ShanghaiTech datasets. The detection accuracy of some anomaly sub-classes is also improved with a great margin.

翻译：弱监督下的视频异常检测因训练阶段缺少帧级标注而具有挑战性。先前工作采用图卷积网络或自注意力机制建模时序关系，并结合基于多实例学习（MIL）的分类损失来学习判别性特征。然而，多数方法使用多分支分别捕获局部和全局依赖关系，导致参数和计算开销增加。此外，MIL损失的二元约束仅能确保粗粒度的类间可分性，忽视了异常类别内部的细粒度判别性。本文提出一种强调高效上下文建模与增强语义判别性的弱监督异常检测框架。为此，我们首先构建时序上下文聚合（TCA）模块，通过复用相似性矩阵与自适应融合捕获完整上下文信息；进而提出提示增强学习（PEL）模块，利用基于知识的提示将语义先验融入模型，旨在增强上下文特征的判别能力并确保异常子类间的可分性。此外，我们在测试阶段引入分数平滑（SS）模块以抑制个体偏差并减少误报。大量实验验证了本方法各组件的有效性，在UCF-crime、XD-violence和ShanghaiTech三个具有挑战性的基准数据集上以更少的参数和计算量取得了具有竞争力的性能，部分异常子类的检测准确率也获得显著提升。