Expert-designed close-ended benchmarks are indispensable in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through \textbf{knowledge-invariant perturbations}. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of \textbf{response consistency analyses} that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six representative LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 25.8% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, and reveals their potential rote memorization to correct options which leads to overestimated performance. We also find that the detailed response consistency analyses by PertEval could illuminate various weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Our findings provide insights for advancing more robust and genuinely knowledgeable LLMs. Our code is available at \url{https://github.com/aigc-apps/PertEval}.
翻译:专家设计的封闭式基准测试对于评估大语言模型的知识能力至关重要。尽管这些测试被广泛使用,但由于测试场景有限以及数据污染风险不可避免,其可靠性日益受到质疑。为纠正这一问题,我们提出了PertEval工具包,该工具包旨在通过**知识不变扰动**深入探究大语言模型的知识能力。这些扰动采用类人类重述技术,从静态基准测试中动态生成测试样本,在精心保留知识关键内容的同时,改变无关细节。我们的工具包还包含一套**响应一致性分析**方法,通过比较模型在原始测试集与扰动测试集上的表现,以精确评估大语言模型的真实知识能力。使用PertEval对六个代表性大语言模型进行了重新评估。结果显示,这些模型在原始基准测试上的表现被显著高估,其中GPT-4的绝对高估幅度达25.8%。此外,通过细致的响应模式分析,我们发现PertEval能够保留大语言模型对可疑知识的不确定性,并揭示其对正确选项可能存在的机械记忆现象,这导致了性能的高估。我们还发现,PertEval提供的详细响应一致性分析能够揭示现有大语言模型在知识掌握方面的各种弱点,并为改进开发提供指导。我们的研究结果为推进构建更稳健、更具真知的大语言模型提供了洞见。代码已发布于 \url{https://github.com/aigc-apps/PertEval}。