AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for measuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.

翻译：社会文化刻板偏见是自然语言处理系统开发与部署中的重要考量因素。尽管存在丰富的次国家社会文化结构，此类偏见却往往仅在国家层面被纳入考量。本文提出AmchiBias——首个专门用于衡量印度果阿邦（该地区具有独特的历史多元文化背景）社会文化刻板偏见的基准数据集。该数据集涵盖多个果阿身份群体，包含313个跨八个社会人口维度的最小配对样本（同时提供英语和天城体孔卡尼语版本）。我们在此基础上评估了五种多语言编码器模型中的刻板偏见。研究发现，模型在孔卡尼语上的表现接近随机水平，反映出通用多语言模型的语言能力不足，以及印度语模型缺乏对果阿文化的能力。当使用英语查询时，具有更强印度语言覆盖范围的模型对泛印度群体的偏见程度高于对本地化果阿群体。这表明英语信号反映的是泛印地预训练关联，而非真正的果阿文化知识。我们的研究结果凸显了针对超本地化社区身份的低资源多语言NLP评估中存在的关键缺口。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

【阿姆斯特丹博士论文】语言模型与人类理解与行为的对齐

专知会员服务

18+阅读 · 2025年7月19日

【ACL2024】用于去偏大语言模型的因果引导主动学习，哈工大SCIR荣获国际顶级会议ACL 2024杰出论文奖

专知会员服务

17+阅读 · 2024年8月17日

【博士论文】语言模型与人类偏好对齐，148页pdf

专知会员服务

32+阅读 · 2024年4月21日