Auditing Large Language Models for Enhanced Text-Based Stereotype Detection and Probing-Based Bias Evaluation

Recent advancements in Large Language Models (LLMs) have significantly increased their presence in human-facing Artificial Intelligence (AI) applications. However, LLMs could reproduce and even exacerbate stereotypical outputs from training data. This work introduces the Multi-Grain Stereotype (MGS) dataset, encompassing 51,867 instances across gender, race, profession, religion, and stereotypical text, collected by fusing multiple previously publicly available stereotype detection datasets. We explore different machine learning approaches aimed at establishing baselines for stereotype detection, and fine-tune several language models of various architectures and model sizes, presenting in this work a series of stereotypes classifier models for English text trained on MGS. To understand whether our stereotype detectors capture relevant features (aligning with human common sense) we utilise a variety of explanainable AI tools, including SHAP, LIME, and BertViz, and analyse a series of example cases discussing the results. Finally, we develop a series of stereotype elicitation prompts and evaluate the presence of stereotypes in text generation tasks with popular LLMs, using one of our best performing previously presented stereotypes detectors. Our experiments yielded several key findings: i) Training stereotype detectors in a multi-dimension setting yields better results than training multiple single-dimension classifiers.ii) The integrated MGS Dataset enhances both the in-dataset and cross-dataset generalisation ability of stereotype detectors compared to using the datasets separately. iii) There is a reduction in stereotypes in the content generated by GPT Family LLMs with newer versions.

翻译：近期，大语言模型（LLMs）的进展显著提升了其在面向人类的人工智能（AI）应用中的参与度。然而，LLMs可能会重现甚至加剧训练数据中的刻板印象输出。本研究引入了多粒度刻板印象（MGS）数据集，该数据集融合了多个先前公开的刻板印象检测数据集，涵盖性别、种族、职业、宗教及刻板印象文本共51,867个实例。我们探索了多种机器学习方法以建立刻板印象检测的基线，并对多种架构和参数规模的语言模型进行微调，由此提出了一系列面向英语文本、基于MGS训练的刻板印象分类器模型。为验证刻板印象检测器是否捕捉到符合人类常识的相关特征，我们利用多种可解释AI工具（包括SHAP、LIME和BertViz），并通过分析示例案例讨论结果。最终，我们开发了一系列刻板印象诱发提示，利用先前表现最佳的刻板印象检测器之一，评估了主流LLMs在文本生成任务中刻板印象的存在情况。实验得出若干关键发现：i）在多维度设定下训练刻板印象检测器优于训练多个单维度分类器；ii）相较于单独使用各数据集，集成的MGS数据集提升了刻板印象检测器的数据集内与跨数据集泛化能力；iii）GPT系列LLMs较新版本所生成内容中的刻板印象有所减少。