Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.
翻译:弱监督多模态视频异常检测已获得广泛关注,但文本模态的潜力仍未得到充分探索。文本提供了显式的语义信息,可增强异常特征刻画并减少误报。然而,由于通用语言模型难以捕捉异常特有的细微差异,且相关描述数据稀缺,提取有效的文本特征仍面临挑战。此外,多模态融合常受冗余与不平衡问题困扰。为解决上述问题,我们提出一种新颖的文本引导框架。首先,我们引入基于上下文学习的多阶段文本增强机制,以生成高质量的异常文本样本,用于微调文本特征提取器。其次,我们设计了一个多尺度瓶颈Transformer融合模块,该模块利用压缩的瓶颈令牌逐步整合跨模态信息,从而缓解冗余与不平衡问题。在UCF-Crime与XD-Violence数据集上的实验验证了所提方法达到了最先进的性能。