Large Language Models (LLMs) are increasingly deployed in multilingual contexts, yet their consistency across languages on politically sensitive topics remains understudied. This paper presents a systematic bilingual benchmark study examining how 17 LLMs respond to questions concerning the sovereignty of the Republic of China (Taiwan) when queried in Chinese versus English. We discover significant language bias -- the phenomenon where the same model produces substantively different political stances depending on the query language. Our findings reveal that 15 out of 17 tested models exhibit measurable language bias, with Chinese-origin models showing particularly severe issues including complete refusal to answer or explicit propagation of Chinese Communist Party (CCP) narratives. Notably, only GPT-4o Mini achieves a perfect 10/10 score in both languages. We propose novel metrics for quantifying language bias and consistency, including the Language Bias Score (LBS) and Quality-Adjusted Consistency (QAC). Our benchmark and evaluation framework are open-sourced to enable reproducibility and community extension.
翻译:大型语言模型(LLMs)在多语言环境中的部署日益广泛,但其在政治敏感议题上跨语言的一致性仍缺乏深入研究。本文通过系统性双语基准研究,考察了17个LLMs在中文与英文查询环境下对中华民国(台湾)主权相关问题的回应差异。我们发现了显著的语言偏见现象——同一模型会因查询语言不同而产生实质差异的政治立场。研究结果显示,17个受测模型中有15个表现出可量化的语言偏见,其中源自中国的模型尤其严重,出现完全拒绝回答或明确传播中国共产党(CCP)叙事的情况。值得注意的是,仅GPT-4o Mini在双语环境下均获得满分10/10。我们提出了量化语言偏见与一致性的创新指标,包括语言偏见分数(LBS)和质量调整一致性(QAC)。本研究的基准框架与评估体系已开源,以确保可复现性并支持学术共同体拓展研究。