FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often resulting in subjective judgments that are misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where). Our benchmark introduces a) FVScore, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information. Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM's ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events with strong visual cues.

翻译：视频异常理解（VAU）是一项专注于描述视频中异常事件的新任务。尽管关注度日益增长，但VAU的评估仍是一个开放的挑战。现有基准依赖于基于n-gram的指标（例如BLEU、ROUGE-L）或基于LLM的评估。前者无法捕捉LVLM响应丰富、自由形式且基于视觉的特性，而后者侧重于评估语言质量而非事实相关性，通常导致与人类感知不一致的主观判断。在本工作中，我们通过提出FineVAU来解决这一问题，这是一个新的VAU基准，它将重点转向对异常视频的丰富、细粒度和特定领域的理解。我们将VAU表述为一个三方面的问题，目标是全面理解视频中异常的关键描述性要素：事件（何事）、参与实体（何人）和位置（何处）。我们的基准引入了：a) FVScore，一种新颖的、与人类对齐的评估指标，用于评估LVLM答案中关键视觉要素的存在性，提供可解释的、细粒度的反馈；以及b) FineW3，一个新颖的、全面的数据集，通过结构化的全自动流程构建，该流程利用高质量的细粒度视觉信息增强了现有的人工标注。人工评估表明，与当前方法相比，我们提出的指标在与人类对异常感知的对齐方面具有优越性。在FineVAU上的详细实验揭示了LVLM在感知需要空间和细粒度时间理解的异常事件方面的关键局限性，尽管其在粗粒度、静态信息以及具有强烈视觉线索的事件上表现出色。