There is a rapidly growing need for multimodal content moderation (CM) as more and more content on social media is multimodal in nature. Existing unimodal CM systems may fail to catch harmful content that crosses modalities (e.g., memes or videos), which may lead to severe consequences. In this paper, we present a novel CM model, Asymmetric Mixed-Modal Moderation (AM3), to target multimodal and unimodal CM tasks. Specifically, to address the asymmetry in semantics between vision and language, AM3 has a novel asymmetric fusion architecture that is designed to not only fuse the common knowledge in both modalities but also to exploit the unique information in each modality. Unlike previous works that focus on representing the two modalities into a similar feature space while overlooking the intrinsic difference between the information conveyed in multimodality and in unimodality (asymmetry in modalities), we propose a novel cross-modality contrastive loss to learn the unique knowledge that only appears in multimodality. This is critical as some harmful intent may only be conveyed through the intersection of both modalities. With extensive experiments, we show that AM3 outperforms all existing state-of-the-art methods on both multimodal and unimodal CM benchmarks.
翻译:社交媒体上越来越多的内容本身具有多模态特性,因此对多模态内容审核(CM)的需求迅速增长。现有的单模态CM系统可能无法检测到跨越模态的有害内容(例如模因或视频),这可能导致严重后果。本文提出了一种新颖的CM模型——非对称混合模态审核(AM3),以针对多模态和单模态CM任务。具体而言,为解决视觉与语言之间的语义非对称性,AM3采用了一种新颖的非对称融合架构,该架构不仅旨在融合两种模态中的共同知识,还旨在挖掘每种模态的独特信息。与以往致力于将两种模态表示为相似特征空间而忽略多模态与单模态信息间固有差异(模态非对称性)的工作不同,我们提出了一种新颖的跨模态对比损失,以学习仅出现在多模态中的独特知识。这一点至关重要,因为某些有害意图可能仅通过两种模态的交叉融合才能传达。通过大量实验,我们证明AM3在所有现有的最先进方法上,无论是在多模态还是单模态CM基准测试中均表现更优。