It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post-training rather than in pre-training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post-training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China-favourability (-0.15 log-odds, p=0.15), the post-trained chat variant is at +2.91 (p<10^-4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French-made Mistral becomes pro-France only under French prompting (FR-EN shift +1.91, p<10^-4). These findings suggest that geopolitical preferences in language models are not simply inherited from large-scale internet data but are actively shaped during post-training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.
翻译:人们通常认为,语言模型中的地缘政治偏见源于预训练阶段所使用的训练数据。我们对来自七家实验室的七组开源权重LLM配对(仅含预训练的基础模型与包含预训练和后训练的对话模型),针对英语、法语和中文的28对国家组合,采用配对场景强制选择探针进行了测试,发现地缘政治偏见产生于后训练阶段而非预训练阶段。在七家AI实验室中,有六家的模型在后训练后呈现出与模型开发者所在国家或地区方向一致的偏移。这一偏移在阿里巴巴Qwen 2.5中最为显著:基础模型对中国倾向保持中性(对数几率比-0.15,p=0.15),而后训练的对话变体却达到+2.91(p<10^-4),几率比偏移达18倍。我们也在所有模型中观察到对其他国家的偏见偏移。此外,此偏移幅度取决于用于提示模型的语言:法国制造的Mistral仅在法语提示下表现出亲法倾向(法-英偏移+1.91,p<10^-4)。这些发现表明,语言模型中的地缘政治偏好并非简单继承自大规模互联网数据,而是在后训练阶段被主动塑造,凸显了需对影响模型呈现国家、文化和政治视角方式的对齐过程加强透明度、审计与监督。