Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM's knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive scene-adaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.
翻译:零样本视频异常检测(ZS-VAD)要求在无目标域训练数据的情况下,对异常事件进行时序定位。由于数据隐私或新部署监控场景等实际考量,该任务具有关键意义。基于骨架的方法因消除背景和人体外观的域差异,在实现ZS-VAD中具有天然可泛化优势。然而现有方法仅学习低层骨架表征,并依赖域受限的正常性边界,难以泛化至具有不同正常与异常行为模式的新场景。本文提出一种新颖的零样本视频异常检测框架,通过动作典型性与独特性学习释放骨架数据的潜力。首先,我们引入语言引导的语义典型性建模模块,将骨架片段投影至动作语义空间,并在训练过程中蒸馏大语言模型(LLM)对典型正常与异常行为的先验知识。其次,我们提出测试时上下文独特性分析模块,精细分析骨架片段间的时空差异,进而推导场景自适应边界。无需使用目标域任何训练样本,本方法在ShanghaiTech、UBnormal、NWPU和UCF-Crime四个大规模VAD数据集(涵盖超100个未见监控场景)上,相比现有骨架方法取得了最先进结果。