Automatic counterspeech generation methods have been developed to assist efforts in combating hate speech. Existing research focuses on generating counterspeech with linguistic attributes such as being polite, informative, and intent-driven. However, the real impact of counterspeech in online environments is seldom considered. This study aims to develop methods for generating counterspeech constrained by conversation outcomes and evaluate their effectiveness. We experiment with large language models (LLMs) to incorporate into the text generation process two desired conversation outcomes: low conversation incivility and non-hateful hater reentry. Specifically, we experiment with instruction prompts, LLM finetuning, and LLM reinforcement learning (RL). Evaluation results show that our methods effectively steer the generation of counterspeech toward the desired outcomes. Our analyses, however, show that there are differences in the quality and style depending on the model.
翻译:自动反言论生成方法已被开发用于辅助打击仇恨言论的努力。现有研究侧重于生成具有礼貌性、信息性和意图驱动等语言属性的反言论。然而,反言论在网络环境中的实际影响却很少被考虑。本研究旨在开发基于对话结果约束的反言论生成方法,并评估其有效性。我们尝试利用大型语言模型(LLMs),在文本生成过程中融入两个期望的对话结果:低对话不文明程度和非仇恨性的仇恨者重新参与。具体而言,我们尝试了指令提示、LLM微调和LLM强化学习(RL)三种方法。评估结果表明,我们的方法能有效地引导反言论的生成朝向期望的结果。然而,我们的分析显示,不同模型在生成质量和风格上存在差异。