How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language

Sociotechnical systems, such as language technologies, frequently exhibit identity-based biases. These biases exacerbate the experiences of historically marginalized communities and remain understudied in low-resource contexts. While models and datasets specific to a language or with multilingual support are commonly recommended to address these biases, this paper empirically tests the effectiveness of such approaches in the context of gender, religion, and nationality-based identities in Bengali, a widely spoken but low-resourced language. We conducted an algorithmic audit of sentiment analysis models built on mBERT and BanglaBERT, which were fine-tuned using all Bengali sentiment analysis (BSA) datasets from Google Dataset Search. Our analyses showed that BSA models exhibit biases across different identity categories despite having similar semantic content and structure. We also examined the inconsistencies and uncertainties arising from combining pre-trained models and datasets created by individuals from diverse demographic backgrounds. We connected these findings to the broader discussions on epistemic injustice, AI alignment, and methodological decisions in algorithmic audits.

翻译：社会技术系统（如语言技术）经常表现出基于身份的偏见。这些偏见加剧了历史上被边缘化群体的遭遇，但在低资源环境下仍缺乏研究。尽管针对特定语言或支持多语言的模型和数据集通常被推荐用于解决这些偏见，本文通过实证检验了这些方法在孟加拉语（一种使用广泛但资源匮乏的语言）中的有效性，重点关注基于性别、宗教和国籍的身份偏见。我们对使用mBERT和BanglaBERT构建的情感分析模型进行了算法审计，这些模型通过Google数据集搜索中的所有孟加拉语情感分析数据集进行了微调。分析表明，尽管语义内容和结构相似，孟加拉语情感分析模型在不同身份类别中仍表现出偏见。我们还考察了将预训练模型与来自不同人口背景个体创建的数据集相结合所产生的不一致性和不确定性。我们将这些发现与关于认知不正义、人工智能对齐以及算法审计中方法决策的更广泛讨论联系起来。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

人工智能军事决策支持系统中的算法偏见问题

专知会员服务

34+阅读 · 2024年9月11日

大规模语言模型的人类偏好学习综述

专知会员服务

42+阅读 · 2024年6月19日