Holistically measuring societal biases of large language models is crucial for detecting and reducing ethical risks in highly capable AI models. In this work, we present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models, covering stereotypes and societal biases in 14 social dimensions related to Chinese culture and values. The curation process contains 4 essential steps: bias identification via extensive literature review, ambiguous context generation, AI-assisted disambiguous context generation, snd manual review \& recomposition. The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control. The dataset exhibits wide coverage and high diversity. Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories. Additionally, we observe from our experiments that fine-tuned models could, to a certain extent, heed instructions and avoid generating outputs that are morally harmful in some types, in the way of "moral self-correction". Our dataset and results are publicly available at \href{https://github.com/YFHuangxxxx/CBBQ}{https://github.com/YFHuangxxxx/CBBQ}, offering debiasing research opportunities to a widened community.
翻译:全面衡量大语言模型的社会偏见对于检测和降低高性能AI模型中的伦理风险至关重要。本文提出一个中文偏见基准数据集,包含由人类专家与生成式语言模型共同构建的超过10万个问题,覆盖与中国文化价值观相关的14个社会维度中的刻板印象与社会偏见。该数据集的构建过程包括四个关键步骤:通过广泛文献综述识别偏见、生成模糊上下文、AI辅助消除上下文歧义、人工审核与重构。数据集中的测试实例自动源自经过严格质量控制人工编写的3000余个高质量模板。该数据集具有广泛的覆盖性和高度多样性。大量实验表明该数据集能有效检测模型偏见,所有10个开源中文大语言模型在特定类别中均表现出显著偏见。此外,我们从实验中观察到,经过微调的模型能在一定程度上遵循指令,并通过"道德自我修正"机制避免生成某些类型的道德危害性输出。数据集及实验结果已在\href{https://github.com/YFHuangxxxx/CBBQ}{https://github.com/YFHuangxxxx/CBBQ}公开,为更广泛的研究群体提供去偏见研究机会。