Video anomaly detection is an ill-posed problem because it relies on many parameters such as appearance, pose, camera angle, background, and more. We distill the problem to anomaly detection of human pose, thus decreasing the risk of nuisance parameters such as appearance affecting the result. Focusing on pose alone also has the side benefit of reducing bias against distinct minority groups. Our model works directly on human pose graph sequences and is exceptionally lightweight (~1K parameters), capable of running on any machine able to run the pose estimation with negligible additional resources. We leverage the highly compact pose representation in a normalizing flows framework, which we extend to tackle the unique characteristics of spatio-temporal pose data and show its advantages in this use case. The algorithm is quite general and can handle training data of only normal examples as well as a supervised setting that consists of labeled normal and abnormal examples. We report state-of-the-art results on two anomaly detection benchmarks - the unsupervised ShanghaiTech dataset and the recent supervised UBnormal dataset.
翻译:视频异常检测是一个病态问题,因为它依赖于诸多参数,例如外观、姿态、摄像头角度、背景等。我们将该问题简化为人体姿态的异常检测,从而降低了外观等干扰参数影响结果的风险。仅关注姿态还具有减少对特定少数群体偏见的附加优势。我们的模型直接处理人体姿态图序列,极其轻量(约1K参数),可在任何能够运行姿态估计的机器上运行,且几乎无需额外资源。我们利用归一化流框架中的高度紧凑姿态表示,并扩展该框架以处理时空姿态数据的独特特性,展示了其在此用例中的优势。该算法具有高度通用性,既能处理仅包含正常样本的训练数据,也能处理包含标记正常和异常样本的监督场景。我们在两个异常检测基准数据集(无监督的ShanghaiTech数据集和近期提出的监督型UBnormal数据集)上取得了当前最优的结果。