Anomalies are rare and anomaly detection is often therefore framed as One-Class Classification (OCC), i.e. trained solely on normalcy. Leading OCC techniques constrain the latent representations of normal motions to limited volumes and detect as abnormal anything outside, which accounts satisfactorily for the openset'ness of anomalies. But normalcy shares the same openset'ness property since humans can perform the same action in several ways, which the leading techniques neglect. We propose a novel generative model for video anomaly detection (VAD), which assumes that both normality and abnormality are multimodal. We consider skeletal representations and leverage state-of-the-art diffusion probabilistic models to generate multimodal future human poses. We contribute a novel conditioning on the past motion of people and exploit the improved mode coverage capabilities of diffusion processes to generate different-but-plausible future motions. Upon the statistical aggregation of future modes, an anomaly is detected when the generated set of motions is not pertinent to the actual future. We validate our model on 4 established benchmarks: UBnormal, HR-UBnormal, HR-STC, and HR-Avenue, with extensive experiments surpassing state-of-the-art results.
翻译:异常是罕见事件,因此异常检测通常被构建为单类分类(OCC),即仅基于正常数据进行训练。主流OCC技术将正常运动的潜在表示约束在有限体积内,并将任何超出范围的数据检测为异常,这较好地解释了异常的开放性特征。但正常数据同样具有开放性特征——人类可以以多种方式执行相同动作,而主流技术忽略了这一点。我们提出了一种新颖的视频异常检测(VAD)生成式模型,该模型假设正常与异常均具有多模态性。我们采用骨架表示,并利用最先进的扩散概率模型生成多模态的未来人体姿态。我们创新性地引入对人体过往运动的约束条件,利用扩散过程改进的模态覆盖能力,生成不同但合理的未来运动。通过对未来模态进行统计聚合,当生成的运动集与实际未来不匹配时即检测为异常。我们在4个权威基准数据集(UBnormal、HR-UBnormal、HR-STC和HR-Avenue)上验证了该模型,通过大量实验超越了当前最先进的结果。