There is a rapidly growing need for multimodal content moderation (CM) as more and more content on social media is multimodal in nature. Existing unimodal CM systems may fail to catch harmful content that crosses modalities (e.g., memes or videos), which may lead to severe consequences. In this paper, we present a novel CM model, Asymmetric Mixed-Modal Moderation (AM3), to target multimodal and unimodal CM tasks. Specifically, to address the asymmetry in semantics between vision and language, AM3 has a novel asymmetric fusion architecture that is designed to not only fuse the common knowledge in both modalities but also to exploit the unique information in each modality. Unlike previous works that focus on fusing the two modalities while overlooking the intrinsic difference between the information conveyed in multimodality and in unimodality (asymmetry in modalities), we propose a novel cross-modality contrastive loss to learn the unique knowledge that only appears in multimodality. This is critical as some harmful intent may only be conveyed through the intersection of both modalities. With extensive experiments, we show that AM3 outperforms all existing state-of-the-art methods on both multimodal and unimodal CM benchmarks.
翻译:社交媒体上越来越多的内容天然具有多模态特性,使得多模态内容审核的需求迅速增长。现有的单模态审核系统可能无法捕捉跨模态有害内容(如梗图或视频),这可能导致严重后果。本文提出一种新型内容审核模型——非对称混合模态审核(AM3),旨在同时解决多模态和单模态审核任务。具体而言,为解决视觉与语言之间的语义非对称性,AM3采用创新的非对称融合架构,该架构不仅融合两种模态中的共通知识,还能挖掘每种模态的独特信息。不同于以往仅聚焦两种模态融合而忽视多模态与单模态信息传递本质差异(模态非对称性)的研究,我们提出一种新的跨模态对比损失函数,专门学习仅出现在多模态中的独特知识。这对检测仅通过双模态交叉才能传达的有害意图至关重要。通过大量实验证明,AM3在多模态和单模态内容审核基准测试中均优于所有现有最优方法。