There is a rapidly growing need for multimodal content moderation (CM) as more and more content on social media is multimodal in nature. Existing unimodal CM systems may fail to catch harmful content that crosses modalities (e.g., memes or videos), which may lead to severe consequences. In this paper, we present a novel CM model, Asymmetric Mixed-Modal Moderation (AM3), to target multimodal and unimodal CM tasks. Specifically, to address the asymmetry in semantics between vision and language, AM3 has a novel asymmetric fusion architecture that is designed to not only fuse the common knowledge in both modalities but also to exploit the unique information in each modality. Unlike previous works that focus on fusing the two modalities while overlooking the intrinsic difference between the information conveyed in multimodality and in unimodality (asymmetry in modalities), we propose a novel cross-modality contrastive loss to learn the unique knowledge that only appears in multimodality. This is critical as some harmful intent may only be conveyed through the intersection of both modalities. With extensive experiments, we show that AM3 outperforms all existing state-of-the-art methods on both multimodal and unimodal CM benchmarks.
翻译:随着社交媒体上多模态内容日益增多,对多模态内容审核(CM)的需求迅速增长。现有的单模态审核系统可能无法检测跨越模态(如表情包或视频)的有害内容,这可能导致严重后果。本文提出了一种新颖的审核模型——非对称混合模态审核器(AM3),旨在同时处理多模态和单模态审核任务。具体而言,为解决视觉与语言之间的语义非对称性,AM3采用了一种新颖的非对称融合架构,不仅能够融合两种模态的公共知识,还能挖掘每种模态的独特信息。与以往仅聚焦于模态融合而忽视多模态与单模态信息内在差异(模态非对称性)的研究不同,我们提出了一种新型跨模态对比损失函数,以学习仅存在于多模态中的独特知识。这一点至关重要,因为某些有害意图可能仅通过两种模态的交集表达。通过大量实验,我们证明AM3在多模态和单模态审核基准测试中均优于所有现有最先进方法。