Prior studies on Video Anomaly Detection (VAD) mainly focus on detecting whether each video frame is abnormal or not in the video, which largely ignore the structured video semantic information (i.e., what, when, and where does the abnormal event happen). With this in mind, we propose a new chat-paradigm \textbf{M}ulti-scene Video Abnormal Event Extraction and Localization (M-VAE) task, aiming to extract the abnormal event quadruples (i.e., subject, event type, object, scene) and localize such event. Further, this paper believes that this new task faces two key challenges, i.e., global-local spatial modeling and global-local spatial balancing. To this end, this paper proposes a Global-local Spatial-sensitive Large Language Model (LLM) named Sherlock, i.e., acting like Sherlock Holmes to track down the criminal events, for this M-VAE task. Specifically, this model designs a Global-local Spatial-enhanced MoE (GSM) module and a Spatial Imbalance Regulator (SIR) to address the two challenges respectively. Extensive experiments on our M-VAE instruction dataset show the significant advantages of Sherlock over several advanced Video-LLMs. This justifies the importance of global-local spatial information for the M-VAE task and the effectiveness of Sherlock in capturing such information.
翻译:先前关于视频异常检测(VAD)的研究主要集中于检测视频中每一帧是否异常,这在很大程度上忽略了结构化的视频语义信息(即异常事件是什么、何时发生、何地发生)。有鉴于此,我们提出了一种新的聊天范式任务——**多场景视频异常事件提取与定位**(M-VAE),旨在提取异常事件四元组(即主体、事件类型、客体、场景)并定位此类事件。进一步,本文认为该新任务面临两个关键挑战,即全局-局部空间建模与全局-局部空间平衡。为此,本文提出了一种名为Sherlock的全局-局部空间敏感大语言模型(LLM)(其作用类似于夏洛克·福尔摩斯追踪犯罪事件),用于解决此M-VAE任务。具体而言,该模型设计了一个全局-局部空间增强的混合专家(GSM)模块和一个空间不平衡调节器(SIR),分别应对上述两个挑战。在我们构建的M-VAE指令数据集上进行的大量实验表明,Sherlock相较于多个先进的视频大语言模型具有显著优势。这证明了全局-局部空间信息对于M-VAE任务的重要性,以及Sherlock在捕捉此类信息方面的有效性。