Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Video anomaly detection under weak supervision presents significant challenges, particularly due to the lack of frame-level annotations during training. While prior research has utilized graph convolution networks and self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features, these methods often employ multi-branch architectures to capture local and global dependencies separately, resulting in increased parameters and computational costs. Moreover, the coarse-grained interclass separability provided by the binary constraint of MIL-based loss neglects the fine-grained discriminability within anomalous classes. In response, this paper introduces a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. We present a Temporal Context Aggregation (TCA) module that captures comprehensive contextual information by reusing the similarity matrix and implementing adaptive fusion. Additionally, we propose a Prompt-Enhanced Learning (PEL) module that integrates semantic priors using knowledge-based prompts to boost the discriminative capacity of context features while ensuring separability between anomaly sub-classes. Extensive experiments validate the effectiveness of our method's components, demonstrating competitive performance with reduced parameters and computational effort on three challenging benchmarks: UCF-Crime, XD-Violence, and ShanghaiTech datasets. Notably, our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy. Our code is available at: https://github.com/yujiangpu20/PEL4VAD.

翻译：弱监督下的视频异常检测面临重大挑战，尤其是在训练过程中缺乏帧级标注。现有研究利用图卷积网络、自注意力机制以及基于多实例学习（MIL）的分类损失来建模时序关系并学习判别性特征，但这些方法通常采用多分支架构分别捕获局部和全局依赖关系，导致参数和计算成本增加。此外，MIL损失的二值约束提供的粗粒度类间可分性忽视了异常类别内的细粒度判别性。为此，本文提出一种弱监督异常检测框架，专注于高效上下文建模和增强语义判别性。我们引入了一个时序上下文聚合（TCA）模块，通过重用相似性矩阵并实现自适应融合来捕获全面的上下文信息。同时，我们提出了一种提示增强学习（PEL）模块，利用基于知识的提示集成语义先验，以增强上下文特征的判别能力，同时确保异常子类之间的可分性。大量实验验证了我们方法各组成部分的有效性，在UCF-Crime、XD-Violence和ShanghaiTech三个具有挑战性的基准数据集上，以更少的参数和计算量展示了具有竞争力的性能。值得注意的是，我们的方法显著提高了某些异常子类的检测精度，突显了其实用价值和有效性。代码已在https://github.com/yujiangpu20/PEL4VAD开源。