The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.
翻译:视频异常数据的采集与检测因其罕见发生和时空稀疏性,长期以来一直是一个具有挑战性的问题。现有的视频异常检测方法在开放世界场景中表现不佳。关键制约因素包括数据集多样性有限,以及对上下文相关异常语义的理解不足。为解决这些问题,i) 我们提出了LAVIDA,一个端到端的零样本视频异常检测框架。ii) LAVIDA采用一个异常暴露采样器,将分割出的对象转化为伪异常,以增强模型对未见异常类别的适应能力。它进一步集成了多模态大语言模型以增强语义理解能力。此外,iii) 我们设计了一种基于反向注意力的令牌压缩方法,以处理异常模式的时空稀疏性并降低计算成本。整个训练过程仅使用伪异常数据,无需任何VAD数据。在四个基准VAD数据集上的评估表明,LAVIDA在零样本设置下,在帧级和像素级异常检测中均达到了最先进的性能。我们的代码可在 https://github.com/VitaminCreed/LAVIDA 获取。