Large Language Models (LLMs), now used daily by millions, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according to an Indian-centric frame through Indian-BhED, a first of its kind dataset, containing stereotypical and anti-stereotypical examples in the context of caste and religious stereotypes in India. We find that the majority of LLMs tested have a strong propensity to output stereotypes in the Indian context, especially when compared to axes of bias traditionally studied in the Western context, such as gender and race. Notably, we find that GPT-2, GPT-2 Large, and GPT 3.5 have a particularly high propensity for preferring stereotypical outputs as a percent of all sentences for the axes of caste (63-79%) and religion (69-72%). We finally investigate potential causes for such harmful behaviour in LLMs, and posit intervention techniques to reduce both stereotypical and anti-stereotypical biases. The findings of this work highlight the need for including more diverse voices when researching fairness in AI and evaluating LLMs.
翻译:大型语言模型(LLM)目前被数百万人日常使用,其可能编码社会偏见,从而使用户面临表征性伤害。尽管已有大量关于LLM偏见的研究,但这些研究主要采用西方中心框架,对全球南方地区的偏见程度及潜在危害关注相对不足。本文通过印度-BhED这一首创数据集,在印度种姓与宗教刻板印象语境下构建了刻板与非刻板实例,基于印度中心框架量化了主流LLM的刻板偏见程度。研究发现,大多数被测LLM在印度语境下表现出强烈的刻板印象输出倾向,尤其在种姓与宗教维度上,其偏见程度显著高于西方语境中传统研究的性别与种族维度。值得注意的是,GPT-2、GPT-2 Large和GPT-3.5在种姓维度(63-79%)和宗教维度(69-72%)上输出刻板内容的句子比例尤为突出。最后,我们探究了LLM产生此类有害行为的潜在成因,并提出同时减少刻板与非刻板偏见的干预技术。本研究的发现凸显了在人工智能公平性研究和LLM评估中纳入多元文化视角的必要性。