The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.
翻译:文本嵌入图像的复杂性对机器学习提出了严峻挑战,因其需要对图像所传达表达方式的多重维度进行跨模态理解。尽管先前多模态分析研究主要集中于仇恨言论及其子类等单一维度,本研究将关注范围扩展至语言学的多个方面:仇恨、仇恨目标、立场与幽默。我们引入了一个新颖数据集PrideMM,其中包含5,063张与LGBTQ+骄傲运动相关的文本嵌入图像,从而弥补了现有资源的严重不足。我们通过单模态与多模态基线方法在PrideMM上进行了广泛实验,为每项任务建立了基准。此外,我们提出了一种新颖框架MemeCLIP,该框架在保持预训练CLIP模型知识的同时实现了高效的下游学习。实验结果表明,在两个真实世界数据集上,MemeCLIP相比先前提出的框架取得了更优性能。我们进一步比较了MemeCLIP与零样本GPT-4在仇恨分类任务上的表现。最后,我们通过定性分析误分类样本,探讨了模型的局限性。我们的代码与数据集已公开于:https://github.com/SiddhantBikram/MemeCLIP。