Detecting stereotypes and biases in Large Language Models (LLMs) can enhance fairness and reduce adverse impacts on individuals or groups when these LLMs are applied. However, the majority of existing methods focus on measuring the model's preference towards sentences containing biases and stereotypes within datasets, which lacks interpretability and cannot detect implicit biases and stereotypes in the real world. To address this gap, this paper introduces a four-stage framework to directly evaluate stereotypes and biases in the generated content of LLMs, including direct inquiry testing, serial or adapted story testing, implicit association testing, and unknown situation testing. Additionally, the paper proposes multi-dimensional evaluation metrics and explainable zero-shot prompts for automated evaluation. Using the education sector as a case study, we constructed the Edu-FairBench based on the four-stage framework, which encompasses 12,632 open-ended questions covering nine sensitive factors and 26 educational scenarios. Experimental results reveal varying degrees of stereotypes and biases in five LLMs evaluated on Edu-FairBench. Moreover, the results of our proposed automated evaluation method have shown a high correlation with human annotations.
翻译:摘要:检测大语言模型(LLMs)中的刻板印象与偏见,有助于提升其应用时的公平性,并减少对个人或群体的负面影响。然而,现有方法大多集中于衡量模型对数据集中包含偏见与刻板印象句子的偏好,这些方法缺乏可解释性,且无法检测现实世界中的隐性偏见与刻板印象。为弥补这一不足,本文提出一个四阶段框架,用于直接评估大语言模型生成内容中的刻板印象与偏见,包括直接询问测试、串行或改编故事测试、内隐联想测试以及未知情境测试。此外,本文还提出了多维评估指标和可解释的零样本提示,以实现自动化评估。以教育领域为案例,我们基于该四阶段框架构建了Edu-FairBench,涵盖覆盖九类敏感因素与26种教育场景的12,632个开放性问题。实验结果表明,在Edu-FairBench上评估的五种大语言模型均表现出不同程度的刻板印象与偏见。此外,我们提出的自动化评估方法的结果与人工标注结果呈现高度相关性。