Resume screening is perceived as a particularly suitable task for LLMs given their ability to analyze natural language; thus many entities rely on general purpose LLMs without further adapting them to the task. While researchers have shown that some LLMs are biased in their selection rates of different demographics, studies measuring the validity of LLM decisions are limited. One of the difficulties in externally measuring validity stems from lack of access to a large corpus of resumes for whom the ground truth in their ranking is known and that has not already been used for LLM training. In this work, we overcome this challenge by systematically constructing a large dataset of resumes tailored to particular jobs that are directly comparable, with a known ground truth of superiority. We then use the constructed dataset to measure the validity of ranking decisions made by various LLMs, finding that many models are unable to consistently select the resumes describing more qualified candidates. Furthermore, when measuring the validity of decisions, we find that models do not reliably abstain when ranking equally-qualified candidates, and select candidates from different demographic groups at different rates, occasionally prioritizing historically-marginalized candidates. Our proposed framework provides a principled approach to audit LLM resume screeners in the absence of ground truth, offering a crucial tool to independent auditors and developers to ensure the validity of these systems as they are deployed.
翻译:简历筛选被认为是大语言模型特别适用的任务,这得益于其分析自然语言的能力;因此许多机构依赖通用大语言模型而无需针对该任务进行额外适配。尽管研究者已证明某些大语言模型对不同人口群体的筛选率存在偏差,但评估大语言模型决策有效性的研究仍较为有限。外部有效性评估的难点之一在于难以获取大规模简历语料库——这些简历需具备已知的真实排序结果,且未被用于大语言模型训练。本研究通过系统构建针对特定职位、可直接比较且具有已知优劣真实标签的大规模简历数据集,克服了这一挑战。随后,我们利用构建的数据集评估了多种大语言模型排序决策的有效性,发现许多模型无法持续筛选出描述更合格候选人的简历。此外,在评估决策有效性时,我们发现模型在排序同等资质候选人时未能可靠地保持中立,且对不同人口群体的候选人筛选率存在差异,偶尔会优先选择历史上被边缘化的候选人。我们提出的框架为在缺乏真实标签的情况下审计大语言模型简历筛选系统提供了原则性方法,为独立审计者和开发者提供了关键工具,以确保这些系统在部署时的有效性。