How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at https://github.com/pipixin321/HolmesVAU.
翻译:如何使模型能够理解发生在不同时间尺度与情境下的视频异常?传统的视频异常理解方法主要关注帧级异常预测,往往难以解释复杂多样的真实世界异常。近期多模态方法虽然利用了视觉与文本数据,但缺乏能够同时捕捉短期与长期异常的分层标注。为应对这一挑战,我们提出了HIVAU-70k——一个面向任意粒度分层视频异常理解的大规模基准数据集。我们开发了一种半自动化标注引擎,通过结合人工视频分割与基于大语言模型的递归自由文本标注,高效扩展了高质量标注的规模。该数据集最终包含超过70,000个按片段级、事件级和视频级组织的多粒度标注。针对长视频中的高效异常检测,我们提出了面向异常的时间采样器。该采样器将异常评分器与密度感知采样器相结合,能够根据异常分数自适应地选择关键帧,确保多模态大语言模型聚焦于异常密集区域,从而显著提升处理效率与准确性。大量实验表明,我们构建的分层指令数据显著提升了异常理解能力。集成了ATS的视觉-语言模型在处理长视频任务上优于传统方法。本研究的基准数据集与模型已在https://github.com/pipixin321/HolmesVAU公开。