Counterspeech can be an effective method for battling hateful content on social media. Automated counterspeech generation can aid in this process. Generated counterspeech, however, can be viable only when grounded in the context of topic, audience and sensitivity as these factors influence both the efficacy and appropriateness. In this work, we propose a novel framework based on theories of discourse to study the inferential links that connect counter speeches to the hateful comment. Within this framework, we propose: i) a taxonomy of counterspeech derived from discourse frameworks, and ii) discourse-informed prompting strategies for generating contextually-grounded counterspeech. To construct and validate this framework, we present a process for collecting an in-the-wild dataset of counterspeech from Reddit. Using this process, we manually annotate a dataset of 3.9k Reddit comment pairs for the presence of hatespeech and counterspeech. The positive pairs are annotated for 10 classes in our proposed taxonomy. We annotate these pairs with paraphrased counterparts to remove offensiveness and first-person references. We show that by using our dataset and framework, large language models can generate contextually-grounded counterspeech informed by theories of discourse. According to our human evaluation, our approaches can act as a safeguard against critical failures of discourse-agnostic models.
翻译:反言论是应对社交媒体上有害内容的有效方法之一。自动化反言论生成技术可辅助这一过程。然而,生成的反言论只有在充分考虑话题、受众及敏感度等语境因素时才能行之有效,因为这些因素直接影响其有效性和恰当性。本研究基于话语理论提出了一种新颖框架,用以探究反言论与仇恨言论之间的推理关联。在该框架内,我们提出:i) 基于话语体系的反言论分类法,以及ii) 面向语境感知的反言论生成的话语驱动提示策略。为构建并验证该框架,我们提出了一套从Reddit收集野外反言论数据集的流程。通过该流程,我们人工标注了3.9k组Reddit评论对,以标记仇恨言论与反言论的存在情况。其中具有正向关联的评论对依据我们提出的分类法标注了10个类别。我们通过改写这些评论对的对应内容,消除了其中冒犯性表述及第一人称指代。实验表明,利用我们的数据集与框架,大型语言模型能够生成基于话语理论且语境感知的反言论。根据人工评估结果,我们的方法可作为防止非话语感知模型出现关键性失效的保障机制。