Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.

翻译：多项选择题（MCQs）常被用于评估大语言模型（LLMs）中编码的知识、推理能力乃至价值取向。尽管多语言性对LLM事实回忆的影响已有研究，本文旨在探讨一个较少被关注的问题：语言因素如何影响价值负载MCQ的回应。多语言LLMs在不同语言间的回答是否一致（即表现出理论上的"多语者"特性），还是会根据问题语言给出不同答案（如同一个模型内存在多个表达不同价值观的"单语模型集合"）？我们发布了名为"多语言欧洲价值观调查"（MEVS）的新语料库，该语料库与以往依赖机器翻译或临时提示的研究不同，完全由人工翻译、涵盖8种欧洲语言且保持对齐的调研问题构成。我们在严格控制提示变量（包括选项顺序、符号类型和结尾字符）的条件下，向三十余个不同规模、厂商及对齐微调状态的多语言LLMs施测了部分问题。研究结果表明：虽然规模更大、经过指令微调的模型整体表现出更高的一致性，但其回答的稳健性在不同问题间差异显著——某些MCQ能引发模型内部及跨模型的完全一致，而另一些问题则导致LLM回答出现分歧。所有经过指令微调的一致性模型均在某些特定问题上表现出语言特异性行为，这提示我们需要进一步研究偏好微调的选择性效应。