Validating Multimedia Content Moderation Software via Semantic Fusion

The exponential growth of social media platforms, such as Facebook and TikTok, has revolutionized communication and content publication in human society. Users on these platforms can publish multimedia content that delivers information via the combination of text, audio, images, and video. Meanwhile, the multimedia content release facility has been increasingly exploited to propagate toxic content, such as hate speech, malicious advertisements, and pornography. To this end, content moderation software has been widely deployed on these platforms to detect and blocks toxic content. However, due to the complexity of content moderation models and the difficulty of understanding information across multiple modalities, existing content moderation software can fail to detect toxic content, which often leads to extremely negative impacts. We introduce Semantic Fusion, a general, effective methodology for validating multimedia content moderation software. Our key idea is to fuse two or more existing single-modal inputs (e.g., a textual sentence and an image) into a new input that combines the semantics of its ancestors in a novel manner and has toxic nature by construction. This fused input is then used for validating multimedia content moderation software. We realized Semantic Fusion as DUO, a practical content moderation software testing tool. In our evaluation, we employ DUO to test five commercial content moderation software and two state-of-the-art models against three kinds of toxic content. The results show that DUO achieves up to 100% error finding rate (EFR) when testing moderation software. In addition, we leverage the test cases generated by DUO to retrain the two models we explored, which largely improves model robustness while maintaining the accuracy on the original test set.

翻译：社交媒体平台（如Facebook和TikTok）的指数级增长，彻底改变了人类社会中的信息交流与内容发布方式。这些平台的用户可通过文本、音频、图像和视频的组合来发布多媒体内容。与此同时，多媒体内容发布功能正被日益滥用，用于传播有害内容，例如仇恨言论、恶意广告和色情信息。为此，内容审核软件被广泛部署于这些平台，用以检测并拦截有害内容。然而，由于内容审核模型的复杂性以及跨模态信息理解的困难，现有内容审核软件可能无法有效检测有害内容，从而常导致极为负面的影响。我们提出一种通用且有效的方法论——语义融合，用于验证多媒体内容审核软件。其核心思想是将两个或多个现有单模态输入（如一个文本句子和一幅图像）融合为一个新输入，该输入以创新方式组合其祖先语义，并通过构造具有有害性质。这个融合输入随后被用于验证多媒体内容审核软件。我们将语义融合实现为DUO——一个实用的内容审核软件测试工具。在评估中，我们使用DUO针对三种有害内容测试了五款商业内容审核软件及两个最新模型。结果表明，在测试审核软件时，DUO的错误发现率最高可达100%。此外，我们利用DUO生成的测试用例对探索的两个模型进行重新训练，在保持原始测试集准确性的同时，大幅提升了模型的鲁棒性。