Large language models (LLMs) are increasingly being introduced in workplace settings, with the goals of improving efficiency and fairness. However, concerns have arisen regarding these models' potential to reflect or exacerbate social biases and stereotypes. This study explores the potential impact of LLMs on hiring practices. To do so, we conduct an algorithm audit of race and gender biases in one commonly-used LLM, OpenAI's GPT-3.5, taking inspiration from the history of traditional offline resume audits. We conduct two studies using names with varied race and gender connotations: resume assessment (Study 1) and resume generation (Study 2). In Study 1, we ask GPT to score resumes with 32 different names (4 names for each combination of the 2 gender and 4 racial groups) and two anonymous options across 10 occupations and 3 evaluation tasks (overall rating, willingness to interview, and hireability). We find that the model reflects some biases based on stereotypes. In Study 2, we prompt GPT to create resumes (10 for each name) for fictitious job candidates. When generating resumes, GPT reveals underlying biases; women's resumes had occupations with less experience, while Asian and Hispanic resumes had immigrant markers, such as non-native English and non-U.S. education and work experiences. Our findings contribute to a growing body of literature on LLM biases, in particular when used in workplace contexts.
翻译:大语言模型(LLMs)正越来越多地被引入工作场所,旨在提升效率与公平性。然而,这些模型可能反映或加剧社会偏见与刻板印象的问题引发了关注。本研究探讨了LLMs对招聘实践的潜在影响。为此,我们借鉴传统线下简历审计的历史经验,对常用LLM——OpenAI的GPT-3.5——进行种族与性别偏见的算法审计。我们通过两项研究使用具有不同种族与性别内涵的姓名:简历评估(研究1)与简历生成(研究2)。在研究1中,我们要求GPT对32个不同姓名(针对2个性别与4个种族群体组合的每组各4个姓名)及两个匿名选项的简历进行评分,覆盖10个职业与3项评估任务(总体评分、面试意愿与可雇佣性)。我们发现该模型反映出某些基于刻板印象的偏见。在研究2中,我们提示GPT为虚构求职者生成简历(每个姓名各10份)。生成简历时,GPT揭示了潜在的偏见:女性简历涉及经验较少的职业,而亚裔及西班牙裔简历包含非母语英语与非美国教育及工作经历等移民标志。我们的发现为关于LLM偏见的日益丰富的文献做出了贡献,特别是在工作场所应用背景下的偏见问题。