AI-generated counterspeech offers a promising and scalable strategy to curb online toxicity through direct replies that promote civil discourse. However, current counterspeech is one-size-fits-all, lacking adaptation to the moderation context and the users involved. We propose and evaluate multiple strategies for generating tailored counterspeech that is adapted to the moderation context and personalized for the moderated user. We instruct an LLaMA2-13B model to generate counterspeech, experimenting with various configurations based on different contextual information and fine-tuning strategies. We identify the configurations that generate persuasive counterspeech through a combination of quantitative indicators and human evaluations collected via a pre-registered mixed-design crowdsourcing experiment. Results show that contextualized counterspeech can significantly outperform state-of-the-art generic counterspeech in adequacy and persuasiveness, without compromising other characteristics. Our findings also reveal a poor correlation between quantitative indicators and human evaluations, suggesting that these methods assess different aspects and highlighting the need for nuanced evaluation methodologies. The effectiveness of contextualized AI-generated counterspeech and the divergence between human and algorithmic evaluations underscore the importance of increased human-AI collaboration in content moderation.
翻译:人工智能生成的反制言论通过直接回复促进文明对话,为遏制网络毒性提供了一种具有前景且可扩展的策略。然而,当前的反制言论采用"一刀切"模式,缺乏对审核情境及所涉用户的适应性调整。本文提出并评估了多种生成定制化反制言论的策略,使其既能适应审核情境,又能针对被监管用户进行个性化调整。我们指导LLaMA2-13B模型生成反制言论,基于不同情境信息和微调策略进行多配置实验。通过结合定量指标与预注册混合设计众包实验收集的人工评估,我们确定了能够生成具有说服力反制言论的配置方案。结果表明,情境化反制言论在适当性和说服力方面显著优于最先进的通用反制言论,且不损害其他特性。我们的研究还发现定量指标与人工评估之间存在弱相关性,说明这些方法评估的是不同维度,并凸显了精细化评估方法的必要性。情境化AI生成反制言论的有效性以及人工与算法评估之间的差异,强调了在内容审核中加强人机协作的重要性。