Anomalies are rare and anomaly detection is often therefore framed as One-Class Classification (OCC), i.e. trained solely on normalcy. Leading OCC techniques constrain the latent representations of normal motions to limited volumes and detect as abnormal anything outside, which accounts satisfactorily for the openset'ness of anomalies. But normalcy shares the same openset'ness property, since humans can perform the same action in several ways, which the leading techniques neglect. We propose a novel generative model for video anomaly detection (VAD), which assumes that both normality and abnormality are multimodal. We consider skeletal representations and leverage state-of-the-art diffusion probabilistic models to generate multimodal future human poses. We contribute a novel conditioning on the past motion of people, and exploit the improved mode coverage capabilities of diffusion processes to generate different-but-plausible future motions. Upon the statistical aggregation of future modes, anomaly is detected when the generated set of motions is not pertinent to the actual future. We validate our model on 4 established benchmarks: UBnormal, HR-UBnormal, HR-STC, and HR-Avenue, with extensive experiments surpassing state-of-the-art results.
翻译:异常是罕见事件,因此异常检测常被构建为单类分类(OCC),即仅基于正常数据进行训练。主流OCC技术将正常运动的潜在表征约束在有限体积内,并将此范围外的内容检测为异常,这种方法能较好地应对异常的开集特性。但正常运动同样具有开集特性——人类执行相同动作时存在多种方式,而主流技术却忽视了这一点。我们提出一种用于视频异常检测(VAD)的新型生成模型,该模型假设正常性与异常性均具有多模态特征。我们采用骨骼表示,并利用最先进的扩散概率模型生成多模态的未来人体姿态。我们提出一种基于人物过去运动的新型条件化方法,通过扩散过程改进的模态覆盖能力生成不同但合理的未来运动。通过对未来模态的统计聚合,当生成的运动集合与实际未来运动不相关时即检测为异常。我们在4个权威基准数据集(UBnormal、HR-UBnormal、HR-STC、HR-Avenue)上验证模型,广泛实验表明其超越了现有最优结果。