We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, the first to combine Large Language and Vision (LLV) models, such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also introduce a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies.
翻译:我们解决了监控视频中帧级异常检测与识别的复杂问题,仅利用视频级别的监督信息。我们提出了一种新颖方法AnomalyCLIP,这是首个将大型语言与视觉(LLV)模型(如CLIP)与多实例学习相结合用于联合视频异常检测与分类的方法。我们的方法具体涉及操控CLIP潜在特征空间以识别正常事件子空间,从而有效学习面向文本的异常事件方向。当异常帧投影到这些方向上时,若属于特定类别,其将呈现较大的特征幅度。我们还引入了一种计算高效的Transformer架构,用于建模帧之间的短期与长期时间依赖性,最终生成异常分数及类别预测概率。我们基于三个主流异常检测基准数据集(ShanghaiTech、UCF-Crime和XD-Violence),将AnomalyCLIP与现有最优方法进行比较,实验结果表明该方法在识别视频异常方面优于基线方法。