Hate speech detection refers to the task of detecting hateful content that aims at denigrating an individual or a group based on their religion, gender, sexual orientation, or other characteristics. Due to the different policies of the platforms, different groups of people express hate in different ways. Furthermore, due to the lack of labeled data in some platforms it becomes challenging to build hate speech detection models. To this end, we revisit if we can learn a generalizable hate speech detection model for the cross platform setting, where we train the model on the data from one (source) platform and generalize the model across multiple (target) platforms. Existing generalization models rely on linguistic cues or auxiliary information, making them biased towards certain tags or certain kinds of words (e.g., abusive words) on the source platform and thus not applicable to the target platforms. Inspired by social and psychological theories, we endeavor to explore if there exist inherent causal cues that can be leveraged to learn generalizable representations for detecting hate speech across these distribution shifts. To this end, we propose a causality-guided framework, PEACE, that identifies and leverages two intrinsic causal cues omnipresent in hateful content: the overall sentiment and the aggression in the text. We conduct extensive experiments across multiple platforms (representing the distribution shift) showing if causal cues can help cross-platform generalization.
翻译:仇恨言论检测是指识别基于宗教、性别、性取向或其他特征对个人或群体进行诋毁的仇恨性内容的任务。由于不同平台的政策差异,不同群体表达仇恨的方式各不相同。此外,部分平台标注数据的匮乏使得构建仇恨言论检测模型面临挑战。为此,我们重新审视能否为跨平台场景学习一个可泛化的仇恨言论检测模型,即在一个(源)平台的数据上训练模型,并将其推广到多个(目标)平台。现有的泛化模型依赖于语言线索或辅助信息,导致其在源平台上偏向某些特定标签或特定类型词汇(如辱骂性词汇),从而难以适用于目标平台。受社会与心理学理论启发,我们致力于探索是否存在固有的因果线索,可被用于学习可泛化的表示以检测跨分布偏移的仇恨言论。基于此,我们提出一个因果引导框架PEACE,该框架识别并利用仇恨内容中普遍存在的两种内在因果线索:文本的整体情感倾向与攻击性。我们在多个平台(代表不同的分布偏移)上进行了大量实验,结果表明因果线索有助于实现跨平台泛化。