Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.
翻译:大语言模型(LLMs)常对低资源语言的区域方言表现出性能偏差,然而量化这些差异的框架仍十分匮乏。我们提出一个两阶段框架,用于评估LLM在九种孟加拉方言问答任务中的方言偏差。首先,采用检索增强生成(RAG)流水线将标准孟加拉语问题翻译并金标为方言变体,构建4,000组问题集。由于传统翻译质量评估指标无法适用于非标准化方言,我们使用LLM-as-a-judge评估翻译保真度,经人工相关性验证表明其性能优于传统指标。其次,我们在这些金标数据集上对19个LLM进行基准测试,通过多评委一致性与人工回退验证,执行68,395次RLAIF评估。研究发现,性能严重下降与语言分化程度显著相关:例如,高分化吉大港方言的回答得分为5.44/10,而坦盖尔方言为7.68/10。此外,增大模型规模并未持续缓解此偏差。我们贡献了一套经过验证的翻译质量评估方法、一个严格的基准数据集,以及针对安全关键应用的临界偏差敏感性(CBS)指标。