This paper aims to address the unsupervised video anomaly detection (VAD) problem, which involves classifying each frame in a video as normal or abnormal, without any access to labels. To accomplish this, the proposed method employs conditional diffusion models, where the input data is the spatiotemporal features extracted from a pre-trained network, and the condition is the features extracted from compact motion representations that summarize a given video segment in terms of its motion and appearance. Our method utilizes a data-driven threshold and considers a high reconstruction error as an indicator of anomalous events. This study is the first to utilize compact motion representations for VAD and the experiments conducted on two large-scale VAD benchmarks demonstrate that they supply relevant information to the diffusion model, and consequently improve VAD performances w.r.t the prior art. Importantly, our method exhibits better generalization performance across different datasets, notably outperforming both the state-of-the-art and baseline methods. The code of our method is available at https://github.com/AnilOsmanTur/conditioned_video_anomaly_diffusion
翻译:本文旨在解决无监督视频异常检测问题,该问题需在无任何标签信息的情况下,将视频中的每一帧分类为正常或异常。为达成此目标,所提方法采用条件扩散模型,其中输入数据为从预训练网络中提取的时空特征,而条件则是从紧凑运动表示中提取的特征,这些表示通过运动与外观信息对给定视频片段进行总结。本方法利用数据驱动阈值,并将高重构误差视为异常事件的指示指标。本研究首次将紧凑运动表示应用于视频异常检测,在两大规模视频异常检测基准上进行的实验表明,紧凑运动表示为扩散模型提供了相关信息,因此相较现有技术提升了视频异常检测性能。重要的是,本方法在不同数据集上展现出更优的泛化性能,显著超越了当前最优方法与基线方法。本方法代码已开源至 https://github.com/AnilOsmanTur/conditioned_video_anomaly_diffusion