In this study, we explore the use of Large Language Models (LLMs) to counteract hate speech. We conducted the first real-life A/B test assessing the effectiveness of LLM-generated counter-speech. During the experiment, we posted 753 automatically generated responses aimed at reducing user engagement under tweets that contained hate speech toward Ukrainian refugees in Poland. Our work shows that interventions with LLM-generated responses significantly decrease user engagement, particularly for original tweets with at least ten views, reducing it by over 20%. This paper outlines the design of our automatic moderation system, proposes a simple metric for measuring user engagement and details the methodology of conducting such an experiment. We discuss the ethical considerations and challenges in deploying generative AI for discourse moderation.
翻译:本研究探讨了使用大语言模型(LLMs)对抗仇恨言论的可行性。我们开展了首个评估LLM生成反驳言论有效性的真实场景A/B测试。实验中,我们在波兰境内针对乌克兰难民的仇恨推文下发布了753条自动生成的回应,旨在降低用户参与度。研究表明,采用LLM生成回应的干预措施能显著降低用户参与度——对于浏览量至少达到10次的原始推文,用户参与度降幅超过20%。本文阐述了自动审核系统的设计框架,提出了衡量用户参与度的简易指标,并详述了实施此类实验的方法论。最后,我们讨论了部署生成式人工智能进行话语审核所涉及的伦理考量与实践挑战。