Exploiting social media to spread hate has tremendously increased over the years. Lately, multi-modal hateful content such as memes has drawn relatively more traction than uni-modal content. Moreover, the availability of implicit content payloads makes them fairly challenging to be detected by existing hateful meme detection systems. In this paper, we present a use case study to analyze such systems' vulnerabilities against external adversarial attacks. We find that even very simple perturbations in uni-modal and multi-modal settings performed by humans with little knowledge about the model can make the existing detection models highly vulnerable. Empirically, we find a noticeable performance drop of as high as 10% in the macro-F1 score for certain attacks. As a remedy, we attempt to boost the model's robustness using contrastive learning as well as an adversarial training-based method - VILLA. Using an ensemble of the above two approaches, in two of our high resolution datasets, we are able to (re)gain back the performance to a large extent for certain attacks. We believe that ours is a first step toward addressing this crucial problem in an adversarial setting and would inspire more such investigations in the future.
翻译:近年来,利用社交媒体传播仇恨的现象急剧增加。最近,像模因这样的多模态仇恨内容比单模态内容获得了更多的关注。此外,隐式内容载荷的存在使它们对现有仇恨模因检测系统构成了相当大的挑战。在本文中,我们通过一个用例研究来分析这些系统在面对外部对抗攻击时的脆弱性。我们发现,即便是在人类对模型了解甚少的情况下,在单模态和多模态设置中执行的简单扰动也能使现有检测模型变得极为脆弱。从实验结果来看,我们注意到在某些攻击下,宏F1分数性能明显下降,高达10%。作为应对,我们尝试使用对比学习以及基于对抗训练的方法——VILLA来增强模型的鲁棒性。通过集成上述两种方法,在我们的两个高分辨率数据集中,我们能够在一定程度上恢复(重新获得)某些攻击下的性能。我们认为,我们的工作是朝着在对抗环境中解决这一关键问题迈出的第一步,并将激发未来更多此类研究。