Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.
翻译:基准测试支撑着大语言模型(LLMs)进展的衡量与信任。然而我们的分析表明,基准准确率表面的趋同可能掩盖深刻的认知分歧。通过使用两个主要推理基准——MMLU-Pro与GPQA——我们发现,即使达到相近准确率的LLMs,在16-66%的测试项上仍存在分歧,前沿顶级模型间的分歧率也达16-38%。这些差异表明不同LLMs具有相异的错误模式。当此类模型被用于科学数据标注与推理时,其隐藏的分歧会传导至研究结果中:在对已发表的教育学与政治学研究的重新分析中,更换标注模型可使估计的处理效应改变超过80%,在某些情况下甚至逆转其符号。这些发现共同揭示了基准幻觉现象——即相等的准确率可能掩盖实际分歧,模型选择由此成为影响科学可复现性的一个隐性却关键的因素。