Internet memes have gained significant influence in communicating political, psychological, and sociocultural ideas. While memes are often humorous, there has been a rise in the use of memes for trolling and cyberbullying. Although a wide variety of effective deep learning-based models have been developed for detecting offensive multimodal memes, only a few works have been done on explainability aspect. Recent laws like "right to explanations" of General Data Protection Regulation, have spurred research in developing interpretable models rather than only focusing on performance. Motivated by this, we introduce {\em MultiBully-Ex}, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes. Here, both visual and textual modalities are highlighted to explain why a given meme is cyberbullying. A Contrastive Language-Image Pretraining (CLIP) projection-based multimodal shared-private multitask approach has been proposed for visual and textual explanation of a meme. Experimental results demonstrate that training with multimodal explanations improves performance in generating textual justifications and more accurately identifying the visual evidence supporting a decision with reliable performance improvements.
翻译:互联网梗图在传达政治、心理和社会文化理念方面具有显著影响力。尽管梗图通常带有幽默色彩,但利用梗图进行挑衅和网络霸凌的现象日益增多。虽然目前已开发出多种基于深度学习的有效模型用于检测攻击性多模态梗图,但在可解释性方面的研究尚属少数。诸如《通用数据保护条例》中"解释权"等新法规的出台,推动了可解释模型的发展,而非仅关注性能表现。受此启发,我们提出了MultiBully-Ex——首个面向混合编码网络霸凌梗图的多模态解释基准数据集。该数据集通过同时标注视觉与文本模态特征,阐释特定梗图何以构成网络霸凌。我们提出了一种基于对比语言-图像预训练(CLIP)投影的多模态共享-私有多任务方法,用于梗图的视觉与文本解释。实验结果表明,采用多模态解释进行训练,不仅能提升生成文本论证的性能,还能更准确地识别支持决策的视觉证据,同时保持可靠的性能改进。