Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos. Anomaly-scoring-based methods have been prevailing for years but suffer from the high complexity of thresholding and low explanability of detection results. In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD, making the VAD model free from thresholds and able to explain the reasons for the detected anomalies. We introduce a novel network module Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range context modeling. We design a three-phase training method to improve the efficiency of fine-tuning VLLMs by substantially minimizing the requirements for VAD data and lowering the costs of annotating instruction-tuning data. Our trained model achieves the top performance on the anomaly videos of the UCF-Crime and TAD benchmarks, with the AUC improvements of +3.86\% and +4.96\%, respectively. More impressively, our approach can provide textual explanations for detected anomalies.
翻译:视频异常检测(VAD)旨在定位长时间监控视频时间线上的异常事件。基于异常评分的方法多年来一直占据主导地位,但存在阈值设定复杂度高、检测结果可解释性低等问题。本文开创性地将基于视频的大语言模型(VLLMs)引入VAD框架,使VAD模型摆脱阈值依赖,并能解释检测到的异常原因。我们提出新型网络模块——长期上下文(LTC),以缓解VLLMs在长程上下文建模方面的不足。通过设计三阶段训练方法,大幅降低VLLMs微调对VAD数据量的需求及指令调优数据的标注成本,从而提升微调效率。在UCF-Crime和TAD基准的异常视频评测中,我们的模型取得了最优性能,AUC分别提升+3.86%和+4.96%。更令人瞩目的是,该方法可为检测到的异常提供文本语义解释。