Gender bias in text corpora used in various natural language processing (NLP) contexts, such as for training large language models (LLMs), can lead to the perpetuation and amplification of societal inequalities. This is particularly pronounced in gendered languages like Spanish or French, where grammatical structures inherently encode gender, making the bias analysis more challenging. Existing methods designed for English are inadequate for this task due to the intrinsic linguistic differences between English and gendered languages. This paper introduces a novel methodology that leverages the contextual understanding capabilities of LLMs to quantitatively analyze gender representation in Spanish corpora. By utilizing LLMs to identify and classify gendered nouns and pronouns in relation to their reference to human entities, our approach provides a nuanced analysis of gender biases. We empirically validate our method on four widely-used benchmark datasets, uncovering significant gender disparities with a male-to-female ratio ranging from 4:1 to 6:1. These findings demonstrate the value of our methodology for bias quantification in gendered languages and suggest its application in NLP, contributing to the development of more equitable language technologies.
翻译:在各种自然语言处理(NLP)场景(例如训练大型语言模型)中使用的文本语料库若存在性别偏见,可能导致社会不平等现象的持续和放大。这在西班牙语或法语等有性别语言中尤为明显,因为其语法结构本身编码了性别,使得偏见分析更具挑战性。由于英语与有性别语言之间存在固有的语言学差异,为英语设计的现有方法对此任务并不适用。本文提出了一种新颖的方法论,利用LLMs的上下文理解能力,定量分析西班牙语料库中的性别表征。通过利用LLMs识别和分类与人类实体指称相关的有性别名词和代词,我们的方法提供了对性别偏见的细致分析。我们在四个广泛使用的基准数据集上对我们的方法进行了实证验证,发现了显著的性别差异,其男女比例范围在4:1至6:1之间。这些发现证明了我们的方法在有性别语言的偏见量化方面的价值,并表明其可在NLP中应用,有助于开发更公平的语言技术。