Pretrained language models inherently exhibit various social biases, prompting a crucial examination of their social impact across various linguistic contexts due to their widespread usage. Previous studies have provided numerous methods for intrinsic bias measurements, predominantly focused on high-resource languages. In this work, we aim to extend these investigations to Bangla, a low-resource language. Specifically, in this study, we (1) create a dataset for intrinsic gender bias measurement in Bangla, (2) discuss necessary adaptations to apply existing bias measurement methods for Bangla, and (3) examine the impact of context length variation on bias measurement, a factor that has been overlooked in previous studies. Through our experiments, we demonstrate a clear dependency of bias metrics on context length, highlighting the need for nuanced considerations in Bangla bias analysis. We consider our work as a stepping stone for bias measurement in the Bangla Language and make all of our resources publicly available to support future research.
翻译:预训练语言模型本身表现出各种社会偏见,由于其广泛使用,促使人们对其在不同语言环境下的社会影响进行关键性考察。先前的研究提出了许多内在偏见度量方法,主要集中于高资源语言。在本工作中,我们旨在将这些研究扩展到孟加拉语这一低资源语言。具体而言,本研究(1)创建了一个用于孟加拉语内在性别偏见度量的数据集,(2)讨论了将现有偏见度量方法应用于孟加拉语所需的必要调整,以及(3)考察了先前研究中被忽视的上下文长度变化对偏见度量的影响。通过实验,我们证明了偏见度量指标对上下文长度存在明显的依赖性,这突显了在孟加拉语偏见分析中需要细致考量。我们将本工作视为孟加拉语偏见度量的基石,并公开所有资源以支持未来研究。