Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Hang Du,Sicheng Zhang,Binzhu Xie,Guoshun Nan,Jiayang Zhang,Junrui Xu,Hangyu Liu,Sicong Leng,Jiangming Liu,Hehe Fan,Dajiu Huang,Jing Feng,Linli Chen,Can Zhang,Xuhuan Li,Hao Zhang,Jianhang Chen,Qimei Cui,Xiaofeng Tao

from arxiv, Accepted in CVPR2024, Codebase: https://github.com/fesvhtr/CUVA

Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. In addition, we also introduce MMEval, a novel evaluation metric designed to better align with human preferences for CUVA, facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally, we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach. Our code and dataset are available at https://github.com/fesvhtr/CUVA.

翻译：视频异常理解（VAU）旨在自动理解视频中的异常事件，从而支持交通监控和工业制造等多种应用。现有VAU基准主要聚焦于异常检测与定位，而我们的目标更具实用性，因此提出以下关键问题：“发生了什么异常？”、“为何发生？”以及“这一异常事件的严重程度如何？”。为寻求这些答案，我们提出了一个面向视频异常因果理解的综合基准（CUVA）。具体而言，该基准的每个实例包含三组人工标注，分别指示异常的“何因”（what）、“何故”（why）和“如何”（how）：1）异常类型、起止时间及事件描述；2）异常原因的自然语言解释；3）反映异常影响的自由文本。此外，我们引入了一种新的评估指标MMEval，旨在更好地对齐人类对CUVA的偏好，便于衡量现有大语言模型（LLMs）对视频异常内在原因及相应影响的理解能力。最后，我们提出了一种基于提示的新方法，可作为挑战性CUVA任务的基线方案。通过大量实验验证了所提评估指标与提示方法的优越性。相关代码与数据集已开源至 https://github.com/fesvhtr/CUVA。