Configurable Spatial-Temporal Hierarchical Analysis for Flexible Video Anomaly Detection

Video anomaly detection (VAD) is a vital task with great practical applications in industrial surveillance, security system, and traffic control. Unlike previous unsupervised VAD methods that adopt a fixed structure to learn normality without considering different detection demands, we design a spatial-temporal hierarchical architecture (STHA) as a configurable architecture to flexibly detect different degrees of anomaly. The comprehensive structure of the STHA is delineated into a tripartite hierarchy, encompassing the following tiers: the stream level, the stack level, and the block level. Specifically, we design several auto-encoder-based blocks that possess varying capacities for extracting normal patterns. Then, we stack blocks according to the complexity degrees with both intra-stack and inter-stack residual links to learn hierarchical normality gradually. Considering the multisource knowledge of videos, we also model the spatial normality of video frames and temporal normality of RGB difference by designing two parallel streams consisting of stacks. Thus, STHA can provide various representation learning abilities by expanding or contracting hierarchically to detect anomalies of different degrees. Since the anomaly set is complicated and unbounded, our STHA can adjust its detection ability to adapt to the human detection demands and the complexity degree of anomaly that happened in the history of a scene. We conduct experiments on three benchmarks and perform extensive analysis, and the results demonstrate that our method performs comparablely to the state-of-the-art methods. In addition, we design a toy dataset to prove that our model can better balance the learning ability to adapt to different detection demands.

翻译：视频异常检测（VAD）是一项关键任务，在工业监控、安全系统及交通控制中具有重要实际应用价值。与以往采用固定结构学习正常模式而忽略不同检测需求的非监督VAD方法不同，我们设计了一种可配置的时空层次架构（STHA），用于灵活检测不同程度的异常。STHA的完整结构被划分为三个层级：流层级、堆叠层级和块层级。具体而言，我们设计了多个具有不同正常模式提取能力的自编码器模块，然后根据复杂度通过层内和层间残差连接堆叠这些模块，以逐步学习层次化正常模式。考虑到视频的多源知识，我们还通过设计两个由堆叠组成的并行流，分别建模视频帧的空间正常性和RGB差值的时间正常性。因此，STHA可通过层次化扩展或收缩提供不同的表示学习能力，从而检测不同级别的异常。由于异常集合复杂且无边界，我们的STHA能够调整其检测能力，以适应人类检测需求以及场景历史中发生的异常复杂度。我们在三个基准数据集上进行实验并开展广泛分析，结果表明我们的方法与最先进方法性能相当。此外，我们设计了一个玩具数据集，证明我们的模型能更好地平衡学习能力以适应不同检测需求。