Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwined meaning of text and images. In this work, we address this issue by proposing a novel approach named ISSUES for multimodal hateful meme classification. ISSUES leverages a pre-trained CLIP vision-language model and the textual inversion technique to effectively capture the multimodal semantic content of the memes. The experiments show that our method achieves state-of-the-art results on the Hateful Memes Challenge and HarMeme datasets. The code and the pre-trained models are publicly available at https://github.com/miccunifi/ISSUES.
翻译:多模态图文梗图在互联网上广泛传播,作为一种独特的交流形式,它结合视觉与文本元素来传达幽默、观点或情感。然而,部分梗图会转向恶意用途,宣扬仇恨内容并加剧歧视。在多模态语境下检测仇恨梗图是一项具有挑战性的任务,需要理解文本与图像交织的语义内涵。本研究针对该问题,提出了一种名为ISSUES的新方法用于多模态仇恨梗图分类。该方法利用预训练的CLIP视觉-语言模型与文本反演技术,有效捕捉梗图的多模态语义内容。实验表明,本方法在Hateful Memes Challenge和HarMeme数据集上均达到了最先进水平。相关代码及预训练模型已在https://github.com/miccunifi/ISSUES 公开。