Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

Large language models (LLMs) increasingly influence global digital ecosystems, yet their potential to perpetuate social and cultural biases remains poorly understood in underrepresented contexts. This study presents a systematic analysis of representational biases in seven state-of-the-art LLMs: GPT-4o-mini, Claude-3-Sonnet, Claude-4-Sonnet, Gemini-2.0-Flash, Gemini-2.0-Lite, Llama-3-70B, and Mistral-Nemo in the Nepali cultural context. Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1) agreement with biased statements and (2) stereotypical completion tendencies. Results show models exhibit measurable explicit agreement bias, with mean bias agreement ranging from 0.36 to 0.43 across decoding configurations, and an implicit completion bias rate of 0.740-0.755. Importantly, implicit completion bias follows a non-linear, U-shaped relationship with temperature, peaking at moderate stochasticity (T=0.3) and declining slightly at higher temperatures. Correlation analysis under different decoding settings revealed that explicit agreement strongly aligns with stereotypical sentence agreement but is a weak and often negative predictor of implicit completion bias, indicating generative bias is poorly captured by agreement metrics. Sensitivity analysis shows increasing top-p amplifies explicit bias, while implicit generative bias remains largely stable. Domain-level analysis shows implicit bias is strongest for race and sociocultural stereotypes, while explicit agreement bias is similar across gender and sociocultural categories, with race showing the lowest explicit agreement. These findings highlight the need for culturally grounded datasets and debiasing strategies for LLMs in underrepresented societies.

翻译：大型语言模型（LLMs）对全球数字生态系统的影响日益增强，然而其在代表性不足的背景下延续社会与文化偏见的潜力仍鲜为人知。本研究对七种最先进的LLMs在尼泊尔文化背景下的表征偏见进行了系统分析，这些模型包括：GPT-4o-mini、Claude-3-Sonnet、Claude-4-Sonnet、Gemini-2.0-Flash、Gemini-2.0-Lite、Llama-3-70B和Mistral-Nemo。我们利用一个符合Croissant标准的、包含2400多对社会领域性别角色刻板印象与反刻板印象句对的数据集，实施了一个评估框架——双指标偏见评估（DMBA）。该框架结合了两种指标：（1）对偏见陈述的认同度，以及（2）刻板印象补全倾向。结果显示，模型表现出可测量的显性认同偏见，在不同解码配置下，平均偏见认同度介于0.36至0.43之间；隐性补全偏见率则在0.740至0.755之间。重要的是，隐性补全偏见与温度参数呈非线性的U型关系，在中等随机性水平（T=0.3）达到峰值，并在更高温度下略有下降。不同解码设置下的相关性分析表明，显性认同度与刻板印象句子的认同度高度一致，但它是隐性补全偏见的弱预测因子，且常常呈负相关，这表明生成性偏见难以通过认同度指标捕捉。敏感性分析显示，增加top-p参数会放大显性偏见，而隐性生成偏见则基本保持稳定。领域层面的分析表明，隐性偏见在种族和社会文化刻板印象方面最强，而显性认同偏见在性别和社会文化类别间相似，其中种族类别的显性认同度最低。这些发现凸显了在代表性不足的社会中，为LLMs开发基于文化的数据集和去偏见策略的必要性。