A Novel Method for Analysing Racial Bias: Collection of Person Level References

Long term exposure to biased content in literature or media can significantly influence people's perceptions of reality, leading to the development of implicit biases that are difficult to detect and address (Gerbner 1998). In this study, we propose a novel method to analyze the differences in representation between two groups and use it examine the representation of African Americans and White Americans in books between 1850 to 2000 with the Google Books dataset (Goldberg and Orwant 2013). By developing better tools to understand differences in representation, we aim to contribute to the ongoing efforts to recognize and mitigate biases. To improve upon the more common phrase based (men, women, white, black, etc) methods to differentiate context (Tripodi et al. 2019, Lucy; Tadimeti, and Bamman 2022), we propose collecting a comprehensive list of historically significant figures and using their names to select relevant context. This novel approach offers a more accurate and nuanced method for detecting implicit biases through reducing the risk of selection bias. We create group representations for each decade and analyze them in an aligned semantic space (Hamilton, Leskovec, and Jurafsky 2016). We further support our results by assessing the time adjusted toxicity (Bassignana, Basile, and Patti 2018) in the context for each group and identifying the semantic axes (Lucy, Tadimeti, and Bamman 2022) that exhibit the most significant differences between the groups across decades. We support our method by showing that our proposed method can capture known socio political changes accurately and our findings indicate that while the relative number of African American names mentioned in books have increased over time, the context surrounding them remains more toxic than white Americans.

翻译：长期暴露于文学作品或媒体中的偏见内容会显著影响人们对现实的感知，导致形成难以检测和应对的内隐偏见（Gerbner 1998）。本研究提出一种新颖方法，用于分析两个群体在表征上的差异，并运用该方法考察1850年至2000年间书籍中对非裔美国人和白人的表征差异，数据源自谷歌图书数据集（Goldberg and Orwant 2013）。通过开发更优工具来理解表征差异，我们旨在为识别和减轻偏见的持续努力做出贡献。为改进基于常见短语（如男性、女性、白人、黑人等）区分语境的更普遍方法（Tripodi et al. 2019; Lucy、Tadimeti 和 Bamman 2022），我们提出收集一份全面的历史重要人物列表，并利用这些人名选择相关语境。这一新颖方法通过降低选择偏差风险，为检测内隐偏见提供了更准确且更具细微差别的手段。我们为每个十年创建群体表征，并在对齐语义空间中进行分析（Hamilton、Leskovec 和 Jurafsky 2016）。我们进一步通过评估每个群体语境中随时间调整的有毒性（Bassignana、Basile 和 Patti 2018），并识别跨十年在群体间表现出最显著差异的语义轴（Lucy、Tadimeti 和 Bamman 2022），来支持我们的结果。我们通过展示所提方法能够准确捕捉已知的社会政治变化来验证其有效性，研究结果表明：尽管书籍中提及的非裔美国人名字相对数量随时间增加，但其周围语境的有毒程度仍高于白人。