This study investigates regional bias in large language models (LLMs), an emerging concern in AI fairness and global representation. We evaluate ten prominent LLMs: GPT-3.5, GPT-4o, Gemini 1.5 Flash, Gemini 1.0 Pro, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3, Gemma 7B, Mistral 7B, and Vicuna-13B using a dataset of 100 carefully designed prompts that probe forced-choice decisions between regions under contextually neutral scenarios. We introduce FAZE, a prompt-based evaluation framework that measures regional bias on a 10-point scale, where higher scores indicate a stronger tendency to favor specific regions. Experimental results reveal substantial variation in bias levels across models, with GPT-3.5 exhibiting the highest bias score (9.5) and Claude 3.5 Sonnet scoring the lowest (2.5). These findings indicate that regional bias can meaningfully undermine the reliability, fairness, and inclusivity of LLM outputs in real-world, cross-cultural applications. This work contributes to AI fairness research by highlighting the importance of inclusive evaluation frameworks and systematic approaches for identifying and mitigating geographic biases in language models.
翻译:本研究探讨了大型语言模型(LLMs)中日益受到关注的地域偏见问题,这是人工智能公平性与全球代表性领域的新兴议题。我们评估了十种主流LLMs:GPT-3.5、GPT-4o、Gemini 1.5 Flash、Gemini 1.0 Pro、Claude 3 Opus、Claude 3.5 Sonnet、Llama 3、Gemma 7B、Mistral 7B和Vicuna-13B,采用包含100个精心设计提示词的数据集,这些提示词在语境中立场景下探究模型在不同地域间的强制选择决策。我们提出了FAZE评估框架——一种基于提示词的测评方法,通过10分量表衡量地域偏见程度,分数越高表明模型对特定地域的倾向性越强。实验结果显示各模型偏见水平存在显著差异:GPT-3.5表现出最高偏见分数(9.5),而Claude 3.5 Sonnet得分最低(2.5)。这些发现表明,在现实世界的跨文化应用中,地域偏见可能实质性地损害LLM输出的可靠性、公平性与包容性。本研究通过强调包容性评估框架的重要性,以及系统化识别与缓解语言模型中地理偏见的方法,为人工智能公平性研究作出了贡献。