Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-the-art performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.
翻译:视频异常检测(VAD)因异常事件的多样性和标注数据有限而极具挑战性。在弱监督视频异常检测(WSVAD)范式下,训练时仅提供视频级标签,而预测需在帧级别进行。尽管现有最优模型在简单异常(如爆炸)上表现良好,但在复杂现实事件(如商店盗窃)上仍存在困难。这一困境源于两个关键问题:(1)当前模型无法处理异常类型的多样性,因其使用共享模型处理所有类别,忽略了类别特异性特征;(2)弱监督信号缺乏精确的时间信息,限制了模型捕捉与正常事件交织的细微异常模式的能力。为应对这些挑战,我们提出高斯溅射引导的专家混合模型(GS-MoE),该新颖框架采用一组专家模型,每个专家专门捕捉特定异常类型。这些专家通过时序高斯溅射损失进行引导,使模型能够利用时序一致性并增强弱监督信号。高斯溅射方法通过聚焦最可能包含异常事件的时序片段,促进对异常更精确、更全面的表征。这些专用专家的预测通过专家混合机制进行整合,以建模跨多样异常模式的复杂关系。我们的方法在UCF-Crime数据集上实现了91.58%的AUC,达到了当前最优性能,并在XD-Violence和MSAD数据集上展现出卓越结果。通过利用类别特异性专业知识与时序引导,GS-MoE为弱监督下的视频异常检测设立了新基准。