Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are optimized with a weakly supervised Automatic Prompt Engineering (APE) framework. Extensive experiments on standard VAD benchmarks demonstrate that PrismVAU delivers competitive detection performance and interpretable anomaly explanations -- without relying on instruction tuning, frame-level annotations, and external modules or dense processing -- making it an efficient and practical solution for real-world applications.
翻译:视频异常理解(VAU)在传统视频异常检测(VAD)的基础上,不仅定位异常,还对其上下文进行描述与推理。现有的VAU方法通常依赖于微调的多模态大语言模型(MLLMs)或外部模块(如视频描述生成器),这些方法引入了昂贵的标注成本、复杂的训练流程以及较高的推理开销。本文提出PrismVAU,一种轻量而高效的实时VAU系统,它利用单一现成的MLLM完成异常评分、解释生成与提示优化。PrismVAU在两个互补阶段运行:(1)粗粒度异常评分模块,通过计算帧与文本锚点的相似度得到帧级异常分数;(2)基于MLLM的优化模块,通过系统提示与用户提示对异常进行上下文情境化。文本锚点与提示均通过弱监督的自动提示工程(APE)框架进行优化。在标准VAD基准上的大量实验表明,PrismVAU在无需指令微调、帧级标注、外部模块或密集处理的情况下,即可提供具有竞争力的检测性能与可解释的异常说明,使其成为实际应用中高效且实用的解决方案。