Bias and Fairness in Large Language Models: A Survey

Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs. We then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly-available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in LLMs.

翻译：大语言模型（LLMs）的快速发展使其能够处理、理解并生成类人文本，并日益融入影响社会领域的各类系统。尽管取得了这些成就，这些模型仍可能学习、延续并放大有害的社会偏见。本文针对大语言模型的偏见评估与缓解技术进行了全面综述。我们首先对自然语言处理中的社会偏见与公平性概念进行了整合、形式化与拓展，界定了不同维度的危害，并提出了若干用于实现大语言模型公平性的理想准则。随后，我们通过提出三种直观的分类法来统一现有文献：其中两种针对偏见评估（即评估指标与数据集），另一种针对偏见缓解。第一套针对偏见评估指标的分类法厘清了指标与评估数据集之间的关系，并依据指标在模型中的操作层级（嵌入层、概率层与生成文本层）进行组织。第二套针对偏见评估数据集的分类法依据数据结构（反事实输入或提示）进行分类，明确了所针对的危害类型与社会群体；同时我们发布了公开数据集的整合资源以提升可访问性。第三套针对偏见缓解技术的分类法依据干预阶段（预处理、训练中、处理中与后处理）对方法进行归类，并通过细粒度子类别阐明研究趋势。最后，我们指出了未来研究中存在的开放性问题与挑战。通过综合梳理近年来的广泛研究，本文旨在为现有文献提供清晰的指引，助力研究者与实践者更好地理解并预防大语言模型中偏见的传播。