In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.
翻译:在现有文献中,先前关于安全导向视频理解(SVU)的研究主要集中于检测和定位视频中的威胁(如枪击、抢劫),而在生成和评估威胁成因方面普遍缺乏有效能力。受此不足启发,本文提出了一种新的聊天范式SVU任务,即深度安全导向视频理解(DeepSVU),其目标不仅在于识别和定位威胁,还需归因和评估威胁片段的成因。进一步地,本文揭示了该任务面临的两个关键挑战:1)如何有效建模从粗粒度到细粒度的物理世界信息(如人类行为、物体交互与背景上下文)以提升DeepSVU任务性能;2)如何自适应权衡这些因素。为应对这些挑战,本文提出了一种新的统一物理世界正则化MoE(UPRM)方法。具体而言,UPRM包含两个关键组件:统一物理世界增强MoE(UPE)模块与物理世界权衡正则化器(PTR),分别用于解决上述两个挑战。在我们构建的DeepSVU指令数据集(即UCF-C指令集与CUVA指令集)上进行的大量实验表明,UPRM在性能上优于多种先进的视频-大语言模型以及非视觉语言模型方法。这些结果证实了从粗到细的物理世界信息在DeepSVU任务中的重要性,并证明了我们提出的UPRM方法在捕捉此类信息方面的有效性。