Large Language Models (LLMs) are increasingly deployed in resume screening pipelines. Although explicit PII (e.g., names) is commonly redacted, resumes typically retain subtle sociocultural markers (languages, co-curricular activities, volunteering, hobbies) that can act as demographic proxies. We introduce a generalisable stress-test framework for hiring fairness, instantiated in the Singapore context: 100 neutral job-aligned resumes are augmented into 4100 variants spanning four ethnicities and two genders, differing only in job-irrelevant markers. We evaluate 18 LLMs in two realistic settings: (i) Direct Comparison (1v1) and (ii) Score & Shortlist (top-scoring rate), each with and without rationale prompting. Even without explicit identifiers, models recover demographic attributes with high F1 and exhibit systematic disparities, with models favouring markers associated with Chinese and Caucasian males. Ablations show language markers suffice for ethnicity inference, whereas gender relies on hobbies and activities. Furthermore, prompting for explanations tends to amplify bias. Our findings suggest that seemingly innocuous markers surviving anonymisation can materially skew automated hiring outcomes.
翻译:大语言模型(LLMs)正越来越多地被部署于简历筛选流程中。尽管显式的个人身份信息(例如姓名)通常会被隐去,但简历通常保留着微妙的社会文化标记(语言、课外活动、志愿服务、兴趣爱好),这些标记可以作为人口统计特征的代理变量。我们引入了一个可推广的招聘公平性压力测试框架,并在新加坡语境下进行了实例化:将100份中性的、与职位相符的简历,扩充为覆盖四种民族和两种性别的4100个变体,这些变体仅在无关工作的标记上存在差异。我们在两种现实场景下评估了18个大语言模型:(i)直接比较(1对1)和(ii)评分与筛选(最高分率),每种场景均在有和没有理由提示的情况下进行。即使没有显式标识符,模型也能以较高的F1值恢复人口统计属性,并表现出系统性差异,模型倾向于青睐与华裔和白人男性相关的标记。消融实验表明,语言标记足以推断民族,而性别推断则依赖于兴趣爱好和活动。此外,要求提供解释的提示往往会放大偏见。我们的研究结果表明,在匿名化后幸存下来的看似无害的标记,可能会实质性地扭曲自动化招聘的结果。