Weakly supervised video anomaly detection (WS-VAD) is a challenging problem that aims to learn VAD models only with video-level annotations. In this work, we propose a Long-Short Temporal Co-teaching (LSTC) method to address the WS-VAD problem. It constructs two tubelet-based spatio-temporal transformer networks to learn from short- and long-term video clips respectively. Each network is trained with respect to a multiple instance learning (MIL)-based ranking loss, together with a cross-entropy loss when clip-level pseudo labels are available. A co-teaching strategy is adopted to train the two networks. That is, clip-level pseudo labels generated from each network are used to supervise the other one at the next training round, and the two networks are learned alternatively and iteratively. Our proposed method is able to better deal with the anomalies with varying durations as well as subtle anomalies. Extensive experiments on three public datasets demonstrate that our method outperforms state-of-the-art WS-VAD methods.
翻译:弱监督视频异常检测(WS-VAD)是一个具有挑战性的问题,旨在仅利用视频级别标注来学习VAD模型。本文提出一种长短期时序协同教学(LSTC)方法来解决WS-VAD问题。该方法构建两个基于tublet的时空变换网络,分别从短视频片段和长视频片段中学习。每个网络通过基于多实例学习(MIL)的排序损失进行训练,并在片段级伪标签可用时结合交叉熵损失。采用协同教学策略来训练这两个网络:即每个网络生成的片段级伪标签用于在下一轮训练中监督另一个网络,两个网络交替迭代学习。所提方法能够更好地处理具有不同持续时间的异常以及细微异常。在三个公开数据集上的大量实验表明,我们的方法优于当前最先进的WS-VAD方法。