The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.
翻译:大语言模型(LLMs)与现实世界的快速发展已超越了广泛使用的评估基准的静态特性,引发了对其评估LLM事实性可靠性的担忧。尽管大量研究仍依赖流行但陈旧的基准,但这些基准与现实世界事实及现代LLMs之间的时间错位及其对LLM事实性评估的影响尚未得到充分探索。因此,本研究通过分析五个流行的事实性基准和八个不同年份发布的LLMs,对该问题进行了系统性探究。我们构建了基于最新事实的检索流程并定制了三个指标,以量化基准的老化程度及其对LLM事实性评估的影响。实验结果表明,广泛使用的事实性基准中有相当比例的样本已过时,导致对LLM事实性的评估不可靠。我们希望本工作能为评估LLM事实性基准的可靠性提供测试基础,并激发更多关于基准老化问题的研究。代码发布于https://github.com/JiangXunyi/BenchAge。