New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. The questions span 52 diverse disciplines, ranging from humanities to science and engineering. C-Eval is accompanied by C-Eval Hard, a subset of very challenging subjects in C-Eval that requires advanced reasoning abilities to solve. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models. Results indicate that only GPT-4 could achieve an average accuracy of over 60%, suggesting that there is still significant room for improvement for current LLMs. We anticipate C-Eval will help analyze important strengths and shortcomings of foundation models, and foster their development and growth for Chinese users.
翻译:为紧跟大语言模型(LLMs)的快速发展,迫切需求新的自然语言处理基准。我们提出C-Eval,这是首个专为评估基础模型在中文语境下高级知识与推理能力而设计的综合性中文评估套件。C-Eval包含涵盖初中、高中、大学及专业四个难度层级的多项选择题。题目横跨52个不同学科,涵盖人文学科至科学与工程领域。C-Eval附带了C-Eval Hard子集,该子集包含C-Eval中极具挑战性的科目,需具备高级推理能力方能解答。我们对包括面向英文和中文模型在内的最新LLMs进行了全面评估。结果表明,仅GPT-4的平均准确率超过60%,表明当前LLMs仍有显著提升空间。我们预期C-Eval将有助于剖析基础模型的重要优势与不足,并推动其针对中文用户的发展与进步。