Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models

Online hate is an escalating problem that negatively impacts the lives of Internet users, and is also subject to rapid changes due to evolving events, resulting in new waves of online hate that pose a critical threat. Detecting and mitigating these new waves present two key challenges: it demands reasoning-based complex decision-making to determine the presence of hateful content, and the limited availability of training samples hinders updating the detection model. To address this critical issue, we present a novel framework called HATEGUARD for effectively moderating new waves of online hate. HATEGUARD employs a reasoning-based approach that leverages the recently introduced chain-of-thought (CoT) prompting technique, harnessing the capabilities of large language models (LLMs). HATEGUARD further achieves prompt-based zero-shot detection by automatically generating and updating detection prompts with new derogatory terms and targets in new wave samples to effectively address new waves of online hate. To demonstrate the effectiveness of our approach, we compile a new dataset consisting of tweets related to three recently witnessed new waves: the 2022 Russian invasion of Ukraine, the 2021 insurrection of the US Capitol, and the COVID-19 pandemic. Our studies reveal crucial longitudinal patterns in these new waves concerning the evolution of events and the pressing need for techniques to rapidly update existing moderation tools to counteract them. Comparative evaluations against state-of-the-art tools illustrate the superiority of our framework, showcasing a substantial 22.22% to 83.33% improvement in detecting the three new waves of online hate. Our work highlights the severe threat posed by the emergence of new waves of online hate and represents a paradigm shift in addressing this threat practically.

翻译：在线仇恨是一个日益严重的问题，对互联网用户的生活产生负面影响，并且由于不断演变的事件而迅速变化，导致新的在线仇恨浪潮构成重大威胁。检测和缓解这些新浪潮面临两个关键挑战：需要基于推理的复杂决策来确定仇恨内容的存在，以及训练样本的有限可用性阻碍了检测模型的更新。为解决这一关键问题，我们提出了一种名为HATEGUARD的新框架，用于有效缓和在线仇恨的新浪潮。HATEGUARD采用基于推理的方法，利用最近引入的链式思维（CoT）提示技术，发挥大语言模型（LLMs）的能力。HATEGUARD进一步通过自动生成和更新包含新贬义词和新浪潮样本中目标的检测提示，实现了基于提示的零样本检测，以有效应对新的在线仇恨浪潮。为证明我们方法的有效性，我们整理了一个新数据集，包含与近期三次新浪潮相关的推文：2022年俄罗斯入侵乌克兰、2021年美国国会大厦暴动以及COVID-19疫情。我们的研究揭示了这些新浪潮中关于事件演变的关键纵向模式，以及快速更新现有缓和工具以对抗这些浪潮的迫切需求。与最先进工具的对比评估展示了我们框架的优越性，在检测三个新的在线仇恨浪潮方面实现了22.22%至83.33%的显著改进。我们的工作强调了新在线仇恨浪潮出现带来的严重威胁，并代表了实际应对这一威胁的范式转变。