Large Language Models (LLMs) encode vast world knowledge across multiple languages, yet their internal beliefs are often unevenly distributed across linguistic spaces. When external evidence contradicts these language-dependent memories, models encounter \emph{cross-lingual knowledge conflict}, a phenomenon largely unexplored beyond English-centric settings. We introduce \textbf{CLEAR}, a \textbf{C}ross-\textbf{L}ingual knowl\textbf{E}dge conflict ev\textbf{A}luation f\textbf{R}amework that systematically examines how multilingual LLMs reconcile conflicting internal beliefs and multilingual external evidence. CLEAR decomposes conflict resolution into four progressive scenarios, from multilingual parametric elicitation to competitive multi-source cross-lingual induction, and systematically evaluates model behavior across two complementary QA benchmarks with distinct task characteristics. We construct multilingual versions of ConflictQA and ConflictingQA covering 10 typologically diverse languages and evaluate six representative LLMs. Our experiments reveal a task-dependent decision dichotomy. In reasoning-intensive tasks, conflict resolution is dominated by language resource abundance, with high-resource languages exerting stronger persuasive power. In contrast, for entity-centric factual conflicts, linguistic affinity, not resource scale, becomes decisive, allowing low-resource but linguistically aligned languages to outperform distant high-resource ones.
翻译:大型语言模型(LLMs)在多种语言中编码了海量的世界知识,但其内部信念在语言空间中的分布往往不均衡。当外部证据与这些依赖语言的记忆相矛盾时,模型便会遭遇**跨语言知识冲突**——这一现象在以英语为中心的研究场景之外尚未得到充分探索。本文提出**CLEAR**,一个**跨语言知识冲突评估框架**,用于系统性地考察多语言LLMs如何调和相互冲突的内部信念与多语言外部证据。CLEAR将冲突解决分解为四种渐进式场景:从多语言参数化激发到竞争性多源跨语言归纳,并基于两个任务特性互补的问答基准,系统评估了模型行为。我们构建了覆盖10种类型学多样语言的ConflictQA与ConflictingQA多语言版本,并评估了六个代表性LLM。实验揭示了一种任务依赖的决策二分性:在推理密集型任务中,冲突解决主要由语言资源丰富度主导,高资源语言展现出更强的说服力;相反,在以实体为中心的事实性冲突中,语言亲缘性而非资源规模成为决定性因素,使得低资源但语言对齐的语言能够超越亲缘性远的高资源语言。