Current methods to identify and classify racist language in text rely on small-n qualitative approaches or large-n approaches focusing exclusively on overt forms of racist discourse. This article provides a step-by-step generalizable guideline to identify and classify different forms of racist discourse in large corpora. In our approach, we start by conceptualizing racism and its different manifestations. We then contextualize these racist manifestations to the time and place of interest, which allows researchers to identify their discursive form. Finally, we apply XLM-RoBERTa (XLM-R), a cross-lingual model for supervised text classification with a cutting-edge contextual understanding of text. We show that XLM-R and XLM-R-Racismo, our pretrained model, outperform other state-of-the-art approaches in classifying racism in large corpora. We illustrate our approach using a corpus of tweets relating to the Ecuadorian ind\'igena community between 2018 and 2021.
翻译:当前识别和分类文本中种族主义语言的方法依赖于小规模定性方法或仅关注显性种族主义话语形式的大规模定量方法。本文提供了一套可分步推广的指南,用于识别和分类大规模语料库中不同形式的种族主义话语。我们的方法首先对种族主义及其不同表现形式进行概念化界定,继而将这些种族主义表现与特定时空背景相关联,使研究者能够识别其话语形式。最后,我们应用XLM-RoBERTa(XLM-R)——一种具备前沿文本语境理解能力的跨语言监督文本分类模型进行实验。研究表明,XLM-R及其预训练模型XLM-R-Racismo在分类大规模语料库中的种族主义内容时,性能优于其他现有最优方法。我们以2018至2021年间与厄瓜多尔原住民社区相关的推文语料库为例,对本方法进行了应用演示。