Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we conduct a comprehensive benchmark study on English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task of detecting targeted identities using a demographic-focused taxonomy. We then systematically evaluate models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable detection frameworks.
翻译:大规模网络文本语料库用于训练通用型AI模型时,常包含有害的面向特定人口群体的社会偏见,这催生了数据审计和开发可扩展偏见检测方法的监管需求。尽管已有研究探讨了文本数据集中的偏见及相关检测方法,但这些研究范围有限:它们通常聚焦单一内容类型(如仇恨言论)、覆盖有限的人口维度、忽视同时影响多重人口群体的偏见,且分析的技术手段有限。因此,从业人员缺乏对现代大型语言模型(LLMs)在自动化偏见检测中优势与局限性的整体认知。本研究针对英语文本开展综合性基准实验,评估LLMs检测面向人口群体的社会偏见的能力。为契合监管要求,我们将偏见检测定义为基于人口属性分类法的多标签任务——识别被针对的身份特征。随后系统评估了跨规模与技术的模型表现,包括提示工程、上下文学习与微调。通过覆盖多样内容类型与人口维度的十二个数据集,本研究表明微调小型模型在可扩展检测中具有潜力。然而,我们的分析也揭示了不同人口维度间及跨人口群体偏见检测的持续差距,凸显了对更高效、可扩展检测框架的需求。