In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.
翻译:在大语言模型(LLMs)快速发展的领域中,基础知识的评估仍然是一项关键挑战,尤其是针对中文语言及文化定制的模型。本文提出了FoundaBench——一个旨在严格评估中文大语言模型基础知识能力的开创性基准。FoundaBench涵盖了常识与K-12教育科目中3354道精心设计的、反映日常生活与学术知识广度与深度的多选题。我们基于FoundaBench对12个最先进的大语言模型进行了全面评估,采用传统评估方法及我们提出的CircularEval协议以减轻模型响应中的潜在偏差。研究结果表明,基于中文语料库预训练的模型表现更优,并揭示了模型推理能力与记忆召回能力之间的显著差异。由FoundaBench评估获得的洞见为理解大语言模型的基础知识能力设立了新标准,并为该领域的未来进展提供了稳健框架。