While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.
翻译:尽管大语言模型(LLMs)越来越多地被用作问答及其他参考条件评估任务的自动裁判,但关于它们遵循给定参考能力的研究尚不充分。我们发现此类基于参考的LLM问答评估存在一个关键失败模式:当提供的参考与裁判模型的参数化知识冲突时,评估分数变得不可靠,显著降低评估保真度。为系统研究这一现象,我们引入了一个受控的交换参考问答框架来诱发参考-信念冲突。具体而言,我们将参考答案替换为错误实体,并构建原始参考与交换参考的多样化配对及相应匹配的候选答案。令人惊讶的是,在广泛使用的裁判模型集合中,交换参考下的评分可靠性急剧下降。我们通过实验证明,这种脆弱性源于裁判过度依赖参数化知识,导致其在冲突中忽视给定参考。最后,我们发现这种失败在常见的基于提示的缓解策略下持续存在,这凸显了LLM作为裁判评估的根本局限性,并启发设计能强制更强遵循给定参考的基于参考的协议。