Recent research has focused on using large language models (LLMs) to generate explanations for hate speech through fine-tuning or prompting. Despite the growing interest in this area, these generated explanations' effectiveness and potential limitations remain poorly understood. A key concern is that these explanations, generated by LLMs, may lead to erroneous judgments about the nature of flagged content by both users and content moderators. For instance, an LLM-generated explanation might inaccurately convince a content moderator that a benign piece of content is hateful. In light of this, we propose an analytical framework for examining hate speech explanations and conducted an extensive survey on evaluating such explanations. Specifically, we prompted GPT-3 to generate explanations for both hateful and non-hateful content, and a survey was conducted with 2,400 unique respondents to evaluate the generated explanations. Our findings reveal that (1) human evaluators rated the GPT-generated explanations as high quality in terms of linguistic fluency, informativeness, persuasiveness, and logical soundness, (2) the persuasive nature of these explanations, however, varied depending on the prompting strategy employed, and (3) this persuasiveness may result in incorrect judgments about the hatefulness of the content. Our study underscores the need for caution in applying LLM-generated explanations for content moderation. Code and results are available at https://github.com/Social-AI-Studio/GPT3-HateEval.
翻译:近期研究聚焦于通过微调或提示方法,利用大语言模型生成仇恨言论解释。尽管该领域关注度日益提升,但对这些生成解释的有效性及潜在局限仍缺乏深入理解。核心问题在于,由大语言模型生成的解释可能导致用户和内容审核员对被标记内容性质产生错误判断。例如,大语言模型生成的解释可能错误地使内容审核员将良性内容判定为仇恨内容。基于此,我们提出了一套分析仇恨言论解释的框架,并开展了针对此类解释评估的广泛调查研究。具体而言,我们引导GPT-3为仇恨与非仇恨内容生成解释,并对2400名独立受访者进行问卷调查以评估生成解释的质量。研究发现:(1)人类评估者认为GPT生成解释在语言流畅性、信息量、说服力和逻辑严密性方面质量较高;(2)然而,这些解释的说服力因所采用的提示策略不同而存在差异;(3)这种说服力可能导致对内容仇恨属性的错误判断。本研究强调,在内容审核中应用大语言模型生成的解释需保持审慎态度。相关代码与结果详见https://github.com/Social-AI-Studio/GPT3-HateEval。