Understanding and mitigating biases is critical for the adoption of large language models (LLMs) in high-stakes decision-making. We introduce Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, as simplified test beds for racial bias. We show that Gemma 2B Instruct and LLaMA 3.2 3B Instruct exhibit strong biases. Gemma grants admission to 26% more White than Black applicants, and LLaMA hires 60% more Asian than White applicants. We demonstrate that these biases are resistant to prompt engineering: multiple prompting strategies all fail to promote fairness. In contrast, using distributed alignment search, we can identify "race subspaces" within model activations and intervene on them to debias model decisions. Averaging the representation across all races within the subspaces reduces Gemma's bias by 37-57%. Finally, we examine the generalizability of Gemma's race subspaces, and find limited evidence for generalization, where changing the prompt format can affect the race representation. Our work suggests mechanistic approaches may provide a promising venue for improving the fairness of LLMs, but a universal race representation remains elusive.
翻译:理解和缓解偏见对于大型语言模型(LLMs)在高风险决策中的采用至关重要。我们引入“招生与招聘”作为简化的种族偏见测试平台,这是两个包含假设申请人档案的决策任务,其中个人的种族可从其姓名推断。我们发现 Gemma 2B Instruct 和 LLaMA 3.2 3B Instruct 表现出强烈的偏见:Gemma 录取的白人申请人比黑人多 26%,而 LLaMA 雇佣的亚裔申请人比白人多 60%。我们证明这些偏见对提示工程具有抵抗性:多种提示策略均未能促进公平性。相比之下,通过分布式对齐搜索,我们可以在模型激活中识别“种族子空间”,并对其进行干预以消除模型决策的偏见。在子空间内对所有种族的表征进行平均,可将 Gemma 的偏见减少 37-57%。最后,我们检验了 Gemma 种族子空间的泛化能力,发现泛化证据有限,改变提示格式会影响种族表征。我们的工作表明,机制性方法可能为提升 LLMs 的公平性提供一条有前景的途径,但通用的种族表征仍然难以实现。