Adaptive Data Collection for Latin-American Community-sourced Evaluation of Stereotypes (LACES)

The evaluation of societal biases in NLP models is critically hindered by a geo-cultural gap, This leaves regions such as Latin America severely underserved, making it impossible to adequately assess or mitigate the perpetuation of harmful regional stereotypes in language technologies. This paper presents LACES, a stereotype association dataset, for 15 Latin American countries. This dataset includes 4,789 stereotype associations manually created and annotated by 83 participants. The dataset was developed through targeted community partnerships across Latin America. Additionally, in this paper, we propose a novel adaptive data collection methodology that uniquely integrates the sourcing of new stereotype entries and the validation of existing data within a single, unified workflow. This approach results in a resource with more unique stereotypes than previous static collection methods, enabling a more efficient stereotype collection. The paper further supports the quality of LACES by demonstrating reduced efficacy of debiasing methods on this dataset in comparison to existing popular stereotype benchmarks.

翻译：自然语言处理模型的社会偏见评估受到地域文化鸿沟的严重制约，这导致拉丁美洲等地区处于严重服务不足的状态，使得我们无法充分评估或缓解语言技术中有害区域刻板印象的延续。本文提出了LACES——一个涵盖15个拉丁美洲国家的刻板印象关联数据集。该数据集包含由83名参与者手动创建和标注的4,789条刻板印象关联。该数据集通过拉丁美洲范围内的定向社区合作开发而成。此外，本文提出了一种新颖的自适应数据收集方法，该方法独特地将新刻板印象条目的采集与现有数据的验证整合在一个统一的工作流程中。相较于以往的静态收集方法，该方法能够收集到更多独特的刻板印象，从而实现更高效的刻板印象收集。本文进一步通过展示在此数据集上偏见缓解方法的效果相较于现有主流刻板印象基准有所下降，从而支持了LACES的质量。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

LargeAD：面向自动驾驶的大规模跨传感器数据预训练

专知会员服务

17+阅读 · 2025年1月8日

【博士论文】语言模型与人类偏好对齐，148页pdf

专知会员服务

32+阅读 · 2024年4月21日

【CVPR2024】GroupContrast：语义感知的自监督表示学习用于三维理解

专知会员服务

18+阅读 · 2024年3月15日