In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose HyperVD, a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. Our framework comprises a detour fusion module for multimodal fusion, effectively alleviating modality inconsistency between audio and visual signals. Additionally, we contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent and normal events. Extensive experiments on the XD-Violence benchmark demonstrate that our method outperforms state-of-the-art methods by a sizable margin.
翻译:近年来,弱监督音频-视觉暴力检测任务引起了广泛关注。该任务旨在基于视频级标签识别多模态数据中的暴力片段。尽管该领域取得了进展,但先前研究使用的传统欧几里得神经网络因特征空间限制,难以捕获高判别性表示。为此,我们提出HyperVD——一种在双曲空间中学习片段嵌入以提升模型判别能力的新型框架。该框架包含用于多模态融合的迂回融合模块,有效缓解了音频与视觉信号之间的模态不一致性。此外,我们构建了两个全双曲图卷积网络分支,用于挖掘双曲空间中片段间的特征相似性与时间关系。通过在这种空间中进行片段表示学习,该框架有效学习了暴力事件与正常事件之间的语义差异。在XD-Violence基准上的大量实验表明,我们的方法以显著优势超越了现有最优方法。