Counterspeech, defined as a response to mitigate online hate speech, is increasingly used as a non-censorial solution. Addressing hate speech effectively involves dispelling the stereotypes, prejudices, and biases often subtly implied in brief, single-sentence statements or abuses. These implicit expressions challenge language models, especially in seq2seq tasks, as model performance typically excels with longer contexts. Our study introduces CoARL, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements. CoARL's first two phases involve sequential multi-instruction tuning, teaching the model to understand intents, reactions, and harms of offensive statements, and then learning task-specific low-rank adapter weights for generating intent-conditioned counterspeech. The final phase uses reinforcement learning to fine-tune outputs for effectiveness and non-toxicity. CoARL outperforms existing benchmarks in intent-conditioned counterspeech generation, showing an average improvement of 3 points in intent-conformity and 4 points in argument-quality metrics. Extensive human evaluation supports CoARL's efficacy in generating superior and more context-appropriate responses compared to existing systems, including prominent LLMs like ChatGPT.
翻译:对抗言论(counterspeech)作为缓解网络仇恨言论的回应方式,正日益被视为一种非审查化解决方案。有效应对仇恨言论需要消除那些常隐含在简短单句陈述或辱骂中的刻板印象、偏见和歧视。这些隐含表达对语言模型构成挑战,尤其在序列到序列任务中,因为模型性能通常更擅长处理较长上下文。本研究提出CoARL——一种通过建模仇恨言论中社会偏见的语用内涵来增强对抗言论生成的新型框架。CoARL的前两阶段包含顺序多指令微调:第一阶段训练模型理解攻击性言论的意图、反应及危害,第二阶段学习生成基于意图的对抗言论所需的任务特定低秩适配器权重。最终阶段采用强化学习对输出结果进行微调,以提升有效性与无毒性。在基于意图的对抗言论生成任务中,CoARL较现有基准方法实现平均3个百分点的意图一致性提升和4个百分点的论证质量提升。广泛的人工评估表明,与包括ChatGPT等主流大语言模型在内的现有系统相比,CoARL能生成更优质且更具上下文适应性的回应。