Multimodal hate detection, which aims to identify harmful content online such as memes, is crucial for building a wholesome internet environment. Previous work has made enlightening exploration in detecting explicit hate remarks. However, most of their approaches neglect the analysis of implicit harm, which is particularly challenging as explicit text markers and demographic visual cues are often twisted or missing. The leveraged cross-modal attention mechanisms also suffer from the distributional modality gap and lack logical interpretability. To address these semantic gaps issues, we propose TOT: a topology-aware optimal transport framework to decipher the implicit harm in memes scenario, which formulates the cross-modal aligning problem as solutions for optimal transportation plans. Specifically, we leverage an optimal transport kernel method to capture complementary information from multiple modalities. The kernel embedding provides a non-linear transformation ability to reproduce a kernel Hilbert space (RKHS), which reflects significance for eliminating the distributional modality gap. Moreover, we perceive the topology information based on aligned representations to conduct bipartite graph path reasoning. The newly achieved state-of-the-art performance on two publicly available benchmark datasets, together with further visual analysis, demonstrate the superiority of TOT in capturing implicit cross-modal alignment.
翻译:多模态仇恨检测旨在识别网络中的有害内容(如表情包),对于构建健康网络环境至关重要。现有研究在显性仇恨言论检测方面已取得启发性进展,但大多数方法忽视了隐性危害的分析——由于显性文本标记与人口统计学视觉线索常被扭曲或缺失,这一任务极具挑战性。此外,现有跨模态注意力机制存在模态分布差异问题,且缺乏逻辑可解释性。为应对上述语义鸿沟,我们提出TOT:一种面向拓扑感知的最优传输框架,用于解析表情包场景中的隐性危害,将跨模态对齐问题建模为最优传输方案的求解。具体而言,我们采用最优传输核方法捕获多模态间的互补信息。该核嵌入具备非线性变换能力,可映射到再生核希尔伯特空间(RKHS),对消除模态分布差异具有关键作用。进一步地,我们基于对齐表征感知拓扑信息,开展二分图路径推理。在两个公开基准数据集上取得的最新最优性能,结合可视化分析,证明了TOT在捕获隐性跨模态对齐方面的优越性。