Current methods to identify and classify racist language in text rely on small-n qualitative approaches or large-n approaches focusing exclusively on overt forms of racist discourse. This article provides a step-by-step generalizable guideline to identify and classify different forms of racist discourse in large corpora. In our approach, we start by conceptualizing racism and its different manifestations. We then contextualize these racist manifestations to the time and place of interest, which allows researchers to identify their discursive form. Finally, we apply XLM-RoBERTa (XLM-R), a cross-lingual model for supervised text classification with a cutting-edge contextual understanding of text. We show that XLM-R and XLM-R-Racismo, our pretrained model, outperform other state-of-the-art approaches in classifying racism in large corpora. We illustrate our approach using a corpus of tweets relating to the Ecuadorian ind\'igena community between 2018 and 2021.
翻译:当前识别和分类文本中种族主义语言的方法依赖于小规模定性分析或仅关注显性种族主义话语的大规模定量方法。本文提供了一套可推广的分步指南,用于识别和分类大规模语料库中不同形式的种族主义话语。在我们的方法中,首先对种族主义及其不同表现形态进行概念化界定,然后结合具体时空背景对这些种族主义表现形态进行情境化分析,使研究者能够识别其话语形式。最后,我们应用XLM-RoBERTa(XLM-R)——一种具备前沿文本语境理解能力的跨语言监督分类模型。研究表明,XLM-R及其预训练模型XLM-R-Racismo在大规模语料库的种族主义分类任务中优于其他现有最优方法。我们以2018至2021年间涉及厄瓜多尔原住民社区的推文语料库为例,对该方法进行实证说明。