As generative large language models (LLMs) grow more performant and prevalent, we must develop comprehensive enough tools to measure and improve their fairness. Different prompt-based datasets can be used to measure social bias across multiple text domains and demographic axes, meaning that testing LLMs on more datasets can potentially help us characterize their biases more fully, and better ensure equal and equitable treatment of marginalized demographic groups. In this work, our focus is two-fold: (1) Benchmarking: a comparison of 6 different prompt-based bias and toxicity metrics across 12 demographic axes and 5 families of generative LLMs. Out of those 6 metrics, AdvPromptSet and HolisticBiasR are novel datasets proposed in the paper. The comparison of those benchmarks gives us insights about the bias and toxicity of the compared models. Therefore, we explore the frequency of demographic terms in common LLM pre-training corpora and how this may relate to model biases. (2) Mitigation: we conduct a comprehensive study of how well 3 bias/toxicity mitigation techniques perform across our suite of measurements. ROBBIE aims to provide insights for practitioners while deploying a model, emphasizing the need to not only measure potential harms, but also understand how they arise by characterizing the data, mitigate harms once found, and balance any trade-offs. We open-source our analysis code in hopes of encouraging broader measurements of bias in future LLMs.
翻译:随着生成式大型语言模型性能日益增强且应用愈发广泛,我们必须开发足够全面的工具来衡量并改进其公平性。不同基于提示的数据集可用于衡量跨多个文本领域和人口统计学维度的社会偏见,这意味着在更多数据集上测试LLMs有助于更全面地刻画其偏见特性,并更好地确保对边缘化人口群体的平等与公正对待。本研究聚焦于两个方向:(1) 基准测试:对6种基于提示的偏见与毒性指标进行对比,涵盖12个人口统计学维度和5个生成式LLMs家族。在这6种指标中,AdvPromptSet和HolisticBiasR是本文提出的新颖数据集。这些基准的对比使我们得以洞察所比较模型的偏见与毒性特征。为此,我们探究了常见LLM预训练语料库中人口统计学术语的出现频率及其与模型偏见的潜在关联。(2) 缓解策略:我们系统研究了3种偏见/毒性缓解技术在我们的测量套件中的表现。ROBBIE旨在为实际部署模型的从业者提供见解,强调不仅需要测量潜在危害,还要通过数据特征分析理解其成因,在发现危害后加以缓解,并平衡各类权衡。我们开源了分析代码,期望推动未来LLM中更广泛的偏见测量。