Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method

Recent progress in reasoning capabilities of Multimodal Large Language Models(MLLMs) has highlighted their potential for performing complex video understanding tasks. However, in the domain of Video Anomaly Detection and Understanding (VAD&U), existing MLLM-based methods are largely limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. To address this gap, we define a new task termed Video Anomaly Reasoning (VAR), which elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning. VAR explicitly requires models to perform progressive reasoning over anomalous events before answering anomaly-related questions, encompassing visual perception, causal interpretation, and risk-aware decision making. To support this task, we present a new dataset with 8,641 videos, where each video is annotated with diverse question types corresponding to different reasoning depths, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly. The annotations are based on a structured Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT), which formalizes domain-specific reasoning priors for video anomaly understanding. This design enables systematic evaluation of multi-stage and adaptive anomaly reasoning. In addition, we propose Anomaly-Aware Group Relative Policy Optimization to further enhance reasoning reliability under weak supervision. Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making. Extensive experiments demonstrate that the proposed benchmark and method effectively advance the reasoning capabilities of MLLMs on VAR tasks, outperforming both open-source and proprietary baselines.

翻译：近年来，多模态大语言模型（MLLMs）在推理能力方面的进展凸显了其在执行复杂视频理解任务上的潜力。然而，在视频异常检测与理解（VAD&U）领域，现有的基于MLLM的方法大多局限于异常定位或事后描述，缺乏显式的推理过程、风险意识以及面向决策的解释。为弥补这一不足，我们定义了一项名为视频异常推理（VAR）的新任务，它将视频异常分析从描述性理解提升至结构化的多阶段推理。VAR明确要求模型在回答异常相关的问题之前，对异常事件进行渐进式推理，涵盖视觉感知、因果解释和风险感知决策。为支持此任务，我们提出了一个包含8,641个视频的新数据集，其中每个视频都标注了对应不同推理深度的多样化问题类型，样本总数超过50,000个，使其成为视频异常领域最大的数据集之一。这些标注基于结构化的感知-认知-行动思维链（PerCoAct-CoT），该框架形式化了视频异常理解中领域特定的推理先验。这一设计使得对多阶段和自适应异常推理的系统性评估成为可能。此外，我们提出了异常感知组相对策略优化，以在弱监督下进一步提升推理的可靠性。基于所提出的任务和数据集，我们开发了一个端到端的基于MLLM的VAR模型，命名为Vad-R1-Plus，该模型支持自适应分层推理和风险感知决策。大量实验表明，所提出的基准和方法有效提升了MLLMs在VAR任务上的推理能力，其性能优于开源和专有的基线模型。