Multimodal hate detection, which aims to identify harmful content online such as memes, is crucial for building a wholesome internet environment. Previous work has made enlightening exploration in detecting explicit hate remarks. However, most of their approaches neglect the analysis of implicit harm, which is particularly challenging as explicit text markers and demographic visual cues are often twisted or missing. The leveraged cross-modal attention mechanisms also suffer from the distributional modality gap and lack logical interpretability. To address these semantic gaps issues, we propose TOT: a topology-aware optimal transport framework to decipher the implicit harm in memes scenario, which formulates the cross-modal aligning problem as solutions for optimal transportation plans. Specifically, we leverage an optimal transport kernel method to capture complementary information from multiple modalities. The kernel embedding provides a non-linear transformation ability to reproduce a kernel Hilbert space (RKHS), which reflects significance for eliminating the distributional modality gap. Moreover, we perceive the topology information based on aligned representations to conduct bipartite graph path reasoning. The newly achieved state-of-the-art performance on two publicly available benchmark datasets, together with further visual analysis, demonstrate the superiority of TOT in capturing implicit cross-modal alignment.
翻译:多模态仇恨检测旨在识别网络上的有害内容(如梗图),对构建健康网络环境至关重要。现有研究在显性仇恨言论检测方面取得了启发性进展,但多数方法忽视了对隐性危害的分析——此类危害尤为棘手,因为显性文本标记和人口统计学视觉线索常被扭曲或缺失。所采用的跨模态注意力机制亦受制于分布模态鸿沟,且缺乏逻辑可解释性。为解决上述语义鸿沟问题,我们提出TOT:一种面向拓扑感知的最优传输框架,用于解析梗图场景中的隐性危害。该框架将跨模态对齐问题重构为最优传输规划的求解过程。具体而言,我们利用最优传输核方法捕获多模态间的互补信息。核嵌入提供非线性变换能力以再生核希尔伯特空间(RKHS),这对消除分布模态鸿沟具有重要价值。此外,我们基于对齐表征感知拓扑信息,开展二分图路径推理。在两个公开基准数据集上取得的最新最优性能,结合进一步的视觉分析,证明了TOT在捕获隐性跨模态对齐方面的优越性。