Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. Firstly, towards unbiased and explainable VAD system, we construct the first large-scale multimodal VAD instruction-tuning benchmark, i.e., VAD-Instruct50k. This dataset is created using a carefully designed semi-automatic labeling paradigm. Efficient single-frame annotations are applied to the collected untrimmed videos, which are then synthesized into high-quality analyses of both abnormal and normal video clips using a robust off-the-shelf video captioner and a large language model (LLM). Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection. We train a lightweight temporal sampler to select frames with high anomaly response and fine-tune a multimodal large language model (LLM) to generate explanatory content. Extensive experimental results validate the generality and interpretability of the proposed Holmes-VAD, establishing it as a novel interpretable technique for real-world video anomaly analysis. To support the community, our benchmark and model will be publicly available at https://holmesvad.github.io.
翻译:针对开放式视频异常检测,现有方法在面对具有挑战性或未见事件时往往存在检测偏差,且缺乏可解释性。为克服这些缺陷,本文提出Holmes-VAD——一种通过精确时序监督与丰富多模态指令实现准确异常定位与全面解释的新型框架。首先,为构建无偏且可解释的VAD系统,我们创建了首个大规模多模态VAD指令微调基准数据集VAD-Instruct50k。该数据集采用精心设计的半自动标注范式构建:对采集的未剪辑视频进行高效单帧标注,继而通过鲁棒的现成视频描述生成器与大语言模型,合成针对异常与正常视频片段的高质量分析。基于VAD-Instruct50k数据集,我们开发了面向可解释视频异常检测的定制化解决方案:训练轻量级时序采样器以筛选具有高异常响应的帧,并微调多模态大语言模型以生成解释性内容。大量实验结果验证了所提Holmes-VAD的泛化能力与可解释性,确立了其作为现实世界视频异常分析新型可解释技术的地位。为促进社区发展,我们的基准数据集与模型将在https://holmesvad.github.io公开提供。