Dialect prejudice predicts AI decisions about people's character, employability, and criminality

Hundreds of millions of people now interact with language models, with uses ranging from serving as a writing aid to informing hiring decisions. Yet these language models are known to perpetuate systematic racial prejudices, making their judgments biased in problematic ways about groups like African Americans. While prior research has focused on overt racism in language models, social scientists have argued that racism with a more subtle character has developed over time. It is unknown whether this covert racism manifests in language models. Here, we demonstrate that language models embody covert racism in the form of dialect prejudice: we extend research showing that Americans hold raciolinguistic stereotypes about speakers of African American English and find that language models have the same prejudice, exhibiting covert stereotypes that are more negative than any human stereotypes about African Americans ever experimentally recorded, although closest to the ones from before the civil rights movement. By contrast, the language models' overt stereotypes about African Americans are much more positive. We demonstrate that dialect prejudice has the potential for harmful consequences by asking language models to make hypothetical decisions about people, based only on how they speak. Language models are more likely to suggest that speakers of African American English be assigned less prestigious jobs, be convicted of crimes, and be sentenced to death. Finally, we show that existing methods for alleviating racial bias in language models such as human feedback training do not mitigate the dialect prejudice, but can exacerbate the discrepancy between covert and overt stereotypes, by teaching language models to superficially conceal the racism that they maintain on a deeper level. Our findings have far-reaching implications for the fair and safe employment of language technology.

翻译：如今数亿人与语言模型互动，其应用范围从写作辅助到招聘决策辅助。然而，已知这些语言模型会延续系统性的种族偏见，导致其对非裔美国人等群体的判断带有偏见性问题。虽然先前研究关注语言模型中的显性种族主义，但社会科学家指出，更隐蔽的种族主义形式已随时间演变。目前尚不清楚这种隐性种族主义是否存在于语言模型中。本研究表明，语言模型以方言偏见形式体现隐性种族主义：我们拓展了美国人对非裔美国人英语使用者存在种族语言刻板印象的研究，发现语言模型具有相同偏见，其隐性刻板印象的负面程度超过任何实验记录中美国人对非裔美国人的刻板印象，最接近民权运动前的水平。相比之下，语言模型对非裔美国人的显性刻板印象则正面得多。我们通过要求语言模型仅依据说话方式对他人做出假设性决策，证明方言偏见可能产生有害后果：语言模型更倾向于建议将非裔美国人英语使用者分配至较低声望岗位、判定其有罪并判处死刑。最后发现，现有减轻语言模型种族偏见的方法（如人类反馈训练）不仅无法消除方言偏见，反而可能加剧显性与隐性刻板印象的差异——语言模型被训练出表面掩饰深层种族主义的能力。本研究对语言技术的公平安全应用具有深远意义。