VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.

翻译：视觉语言模型（VLM）的快速发展为视频异常检测（VAD）确立了一种新范式：利用VLM同时检测异常并为决策提供可理解的解释。该方向的现有工作通常假设VAD所需的复杂推理超出了预训练VLM的能力范围。因此，这些方法要么在推理过程中引入专门的推理模块，要么依赖指令微调数据集并通过额外训练来使VLM适应VAD任务。然而，此类策略往往带来巨大的计算成本或数据标注开销。为了应对可解释VAD中的这些挑战，我们提出了一个名为VERA的言语化学习框架，使VLM能够在不修改模型参数的情况下执行VAD。具体而言，VERA自动将VAD所需的复杂推理分解为对更简单、更聚焦的引导性问题的反思，这些问题捕获了不同的异常模式。它将这些问题视为可学习的参数，并通过学习者和优化器VLM之间基于粗略标注训练数据的、数据驱动的言语交互进行优化。在推理过程中，VERA将学习到的问题嵌入模型提示中，以引导VLM生成片段级异常分数，然后通过融合场景和时序上下文将这些分数细化为帧级分数。在具有挑战性的基准测试上的实验结果表明，VERA学习到的问题具有高度适应性，显著提升了VLM在VAD任务上的检测性能和可解释性。