Evaluating Gender Bias in Large Language Models

Gender bias in artificial intelligence has become an important issue, particularly in the context of language models used in communication-oriented applications. This study examines the extent to which Large Language Models (LLMs) exhibit gender bias in pronoun selection in occupational contexts. The analysis evaluates the models GPT-4, GPT-4o, PaLM 2 Text Bison and Gemini 1.0 Pro using a self-generated dataset. The jobs considered include a range of occupations, from those with a significant male presence to those with a notable female concentration, as well as jobs with a relatively equal gender distribution. Three different sentence processing methods were used to assess potential gender bias: masked tokens, unmasked sentences, and sentence completion. In addition, the LLMs suggested names of individuals in specific occupations, which were then examined for gender distribution. The results show a positive correlation between the models' pronoun choices and the gender distribution present in U.S. labor force data. Female pronouns were more often associated with female-dominated occupations, while male pronouns were more often associated with male-dominated occupations. Sentence completion showed the strongest correlation with actual gender distribution, while name generation resulted in a more balanced 'politically correct' gender distribution, albeit with notable variations in predominantly male or female occupations. Overall, the prompting method had a greater impact on gender distribution than the model selection itself, highlighting the complexity of addressing gender bias in LLMs. The findings highlight the importance of prompting in gender mapping.

翻译：人工智能中的性别偏见已成为一个重要议题，尤其在面向通信应用的语言模型背景下。本研究考察了大型语言模型（LLMs）在职业语境下代词选择中表现性别偏见的程度。该分析使用自生成数据集评估了GPT-4、GPT-4o、PaLM 2 Text Bison和Gemini 1.0 Pro模型。所考察的职业涵盖从男性占主导到女性占主导，以及性别分布相对均衡的各类工作岗位。研究采用三种不同的句子处理方法评估潜在性别偏见：掩码标记、未掩码句子和句子补全。此外，LLMs被要求推荐特定职业的人物姓名，随后对这些姓名的性别分布进行了检验。结果显示，模型的代词选择与美国劳动力数据中的性别分布呈正相关。女性代词更常与女性主导的职业相关联，而男性代词则更常与男性主导的职业相关联。句子补全方法与实际性别分布的相关性最强，而姓名生成则产生了更为均衡的“政治正确”性别分布，不过在男性或女性占绝对主导的职业中仍存在显著差异。总体而言，提示方法对性别分布的影响大于模型选择本身，这凸显了解决LLMs中性别偏见问题的复杂性。研究结果强调了提示方法在性别映射中的重要性。