Recent research has focused on using large language models (LLMs) to generate explanations for hate speech through fine-tuning or prompting. Despite the growing interest in this area, these generated explanations' effectiveness and potential limitations remain poorly understood. A key concern is that these explanations, generated by LLMs, may lead to erroneous judgments about the nature of flagged content by both users and content moderators. For instance, an LLM-generated explanation might inaccurately convince a content moderator that a benign piece of content is hateful. In light of this, we propose an analytical framework for examining hate speech explanations and conducted an extensive survey on evaluating such explanations. Specifically, we prompted GPT-3 to generate explanations for both hateful and non-hateful content, and a survey was conducted with 2,400 unique respondents to evaluate the generated explanations. Our findings reveal that (1) human evaluators rated the GPT-generated explanations as high quality in terms of linguistic fluency, informativeness, persuasiveness, and logical soundness, (2) the persuasive nature of these explanations, however, varied depending on the prompting strategy employed, and (3) this persuasiveness may result in incorrect judgments about the hatefulness of the content. Our study underscores the need for caution in applying LLM-generated explanations for content moderation. Code and results are available at https://github.com/Social-AI-Studio/GPT3-HateEval.
翻译:近期研究聚焦于通过微调或提示方法,利用大语言模型生成仇恨言论的解释。尽管该领域关注度日益提升,但这些生成解释的有效性及潜在局限仍不明确。核心问题在于,大语言模型生成的解释可能导致用户与内容审核者对标记内容的性质产生错误判断。例如,大语言模型生成的解释可能错误地使内容审核者将中性内容判定为仇恨言论。基于此,我们提出了一个分析框架以系统考察仇恨言论解释,并针对此类解释的评估开展了大规模调研。具体而言,我们引导GPT-3为仇恨与非仇恨内容生成解释,并面向2400名独立受访者开展调研以评估生成解释的质量。研究发现:(1)人类评估者认为GPT生成解释在语言流畅性、信息丰富度、说服力及逻辑合理性方面质量较高;(2)然而,这些解释的说服力会因采用的提示策略不同而呈现差异;(3)这种说服力可能导致对内容仇恨属性的误判。本研究强调,在内容审核中应用大语言模型生成的解释时需保持审慎。相关代码与结果详见https://github.com/Social-AI-Studio/GPT3-HateEval。