Recently, research based on pre-trained models has demonstrated outstanding performance in violence surveillance tasks. However, most of them were black-box systems which faced challenges regarding explainability during training and inference processes. An important question is how to incorporate explicit knowledge into these implicit models, thereby designing expertdriven and interpretable violence surveillance systems. This paper proposes a new paradigm for weakly supervised violence monitoring (WSVM) called Rule base Violence Monitoring (RuleVM). The proposed RuleVM uses a dual-branch structure with different designs for images and text. One of the branches is called the implicit branch, which uses only visual features for coarse-grained binary classification. In this branch, image feature extraction is divided into two channels: one responsible for extracting scene frames and the other focusing on extracting actions. The other branch is called the explicit branch, which utilizes language-image alignment to perform fine-grained classification. For the language channel design in the explicit branch, the proposed RuleVM uses the state-of-the-art YOLOWorld model to detect objects in video frames, and association rules are identified through data mining methods as descriptions of the video. Leveraging the dual-branch architecture, RuleVM achieves interpretable coarse-grained and fine-grained violence surveillance. Extensive experiments were conducted on two commonly used benchmarks, and the results show that RuleVM achieved the best performance in both coarse-grained and finegrained monitoring, significantly outperforming existing state-ofthe-art methods. Moreover, interpretability experiments uncovered some interesting rules, such as the observation that as the number of people increases, the risk level of violent behavior also rises.
翻译:近年来,基于预训练模型的研究在暴力监控任务中展现出卓越性能。然而,大多数系统属于黑箱模型,在训练和推理过程中面临可解释性挑战。关键问题在于如何将显式知识融入这些隐式模型,从而设计专家驱动且可解释的暴力监控系统。本文提出一种新的弱监督暴力监控范式——基于规则的暴力监控系统。该系统采用针对图像与文本的双分支架构:隐式分支仅利用视觉特征进行粗粒度二分类,其图像特征提取分为场景帧提取与动作提取双通道;显式分支则通过语言-图像对齐实现细粒度分类,其中语言通道采用最先进的YOLOWorld模型检测视频帧中的对象,并通过数据挖掘方法识别关联规则作为视频描述。借助双分支架构,该系统实现了可解释的粗粒度与细粒度暴力监控。在两个常用基准数据集上的大量实验表明,该系统在粗粒度与细粒度监控任务中均取得最优性能,显著超越现有先进方法。可解释性实验还揭示了若干有趣规则,例如观察到随着人数增加,暴力行为的风险等级相应上升。