Generated texts from large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics. These findings motivate research efforts aiming to understand and measure such effects. Prior works have proposed benchmarks for identifying and techniques for mitigating these stereotypical associations. However, as recent research pointed out, existing benchmarks lack a robust experimental setup, hindering the inference of meaningful conclusions from their evaluation metrics. In this paper, we introduce a list of desiderata for robustly measuring biases in generative language models. Building upon these design principles, we propose a benchmark called OCCUGENDER, with a bias-measuring procedure to investigate occupational gender bias. We then use this benchmark to test several state-of-the-art open-source LLMs, including Llama, Mistral, and their instruction-tuned versions. The results show that these models exhibit substantial occupational gender bias. We further propose prompting techniques to mitigate these biases without requiring fine-tuning. Finally, we validate the effectiveness of our methods through experiments on the same set of models.
翻译:大型语言模型(LLM)生成的文本已被证明会针对不同人群表现出多种有害的、类人的偏见。这些发现推动了旨在理解和衡量此类效应的研究努力。先前的研究提出了用于识别此类刻板关联的基准测试及相应的缓解技术。然而,正如近期研究指出,现有基准测试缺乏稳健的实验设置,阻碍了从其评估指标中得出有意义的结论。本文提出了一系列用于稳健测量生成式语言模型中偏见的期望标准。基于这些设计原则,我们提出了一个名为 OCCUGENDER 的基准测试,并配套一个用于研究职业性别偏见的偏见测量流程。随后,我们使用该基准测试了包括 Llama、Mistral 及其指令微调版本在内的多个先进开源 LLM。结果表明,这些模型表现出显著的职业性别偏见。我们进一步提出了无需微调即可缓解这些偏见的提示技术。最后,我们通过对同一组模型的实验验证了我们方法的有效性。