Data-driven analyses of biases in historical texts can help illuminate the origin and development of biases prevailing in modern society. However, digitised historical documents pose a challenge for NLP practitioners as these corpora suffer from errors introduced by optical character recognition (OCR) and are written in an archaic language. In this paper, we investigate the continuities and transformations of bias in historical newspapers published in the Caribbean during the colonial era (18th to 19th centuries). Our analyses are performed along the axes of gender, race, and their intersection. We examine these biases by conducting a temporal study in which we measure the development of lexical associations using distributional semantics models and word embeddings. Further, we evaluate the effectiveness of techniques designed to process OCR-generated data and assess their stability when trained on and applied to the noisy historical newspapers. We find that there is a trade-off between the stability of the word embeddings and their compatibility with the historical dataset. We provide evidence that gender and racial biases are interdependent, and their intersection triggers distinct effects. These findings align with the theory of intersectionality, which stresses that biases affecting people with multiple marginalised identities compound to more than the sum of their constituents.
翻译:基于数据驱动的历史文本偏见分析有助于揭示现代社会普遍存在的偏见的起源与发展。然而,数字化历史文献给自然语言处理从业者带来了挑战,因为这类语料库存在光学字符识别引入的误差,且使用古语写作。本文研究了殖民时期(18至19世纪)加勒比地区历史报纸中偏见的连续性与演变。我们沿着性别、种族及其交叉维度进行分析,通过时间序列研究,利用分布语义模型和词嵌入测量词汇关联的演变。此外,我们评估了针对OCR生成数据处理技术的有效性,并考察了其在噪声历史报纸数据上训练和应用的稳定性。研究发现词嵌入的稳定性与其对历史数据集的兼容性之间存在权衡。我们证实性别偏见与种族偏见相互依存,且其交叉引发了独特的效应。这一发现与交叉性理论相吻合,该理论强调影响多重边缘化身份群体的偏见会形成复合效应,其作用远超单一偏见的简单叠加。