Video anomaly detection (VAD) is an important but challenging task in computer vision. The main challenge rises due to the rarity of training samples to model all anomaly cases. Hence, semi-supervised anomaly detection methods have gotten more attention, since they focus on modeling normals and they detect anomalies by measuring the deviations from normal patterns. Despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. Inspired by the abilities of the future frame prediction proxy-task, we introduce the task of future video prediction from a single frame, as a novel proxy-task for video anomaly detection. This proxy-task alleviates the challenges of previous methods in learning longer motion patterns. Moreover, we replace the initial and future raw frames with their corresponding semantic segmentation map, which not only makes the method aware of object class but also makes the prediction task less complex for the model. Extensive experiments on the benchmark datasets (ShanghaiTech, UCSD-Ped1, and UCSD-Ped2) show the effectiveness of the method and the superiority of its performance compared to SOTA prediction-based VAD methods.
翻译:视频异常检测(VAD)是计算机视觉中一项重要但具有挑战性的任务。其主要难点源于可用于建模所有异常案例的训练样本稀少。因此,半监督异常检测方法获得了更多关注,这类方法专注于对正常模式建模,并通过测量与正常模式的偏差来检测异常。尽管这些方法在建模正常运动与外观方面取得了显著进展,但长期运动建模至今仍未得到有效探索。受未来帧预测代理任务能力的启发,我们引入了一种基于单帧的未来视频预测任务,作为视频异常检测的新型代理任务。该代理任务缓解了先前方法在学习更长运动模式时面临的挑战。此外,我们将初始帧与未来原始帧替换为对应的语义分割图,这不仅使模型感知物体类别,还降低了预测任务的复杂度。在基准数据集(ShanghaiTech、UCSD-Ped1和UCSD-Ped2)上的大量实验表明,该方法具有有效性,且其性能优于基于预测的最先进的VAD方法。