C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Yuzhen Huang,Yuzhuo Bai,Zhihao Zhu,Junlei Zhang,Jinghan Zhang,Tangjun Su,Junteng Liu,Chuancheng Lv,Yikai Zhang,Jiayi Lei,Yao Fu,Maosong Sun,Junxian He

from arxiv, NeurIPS 2023. Website: https://cevalbenchmark.com

New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. The questions span 52 diverse disciplines, ranging from humanities to science and engineering. C-Eval is accompanied by C-Eval Hard, a subset of very challenging subjects in C-Eval that requires advanced reasoning abilities to solve. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models. Results indicate that only GPT-4 could achieve an average accuracy of over 60%, suggesting that there is still significant room for improvement for current LLMs. We anticipate C-Eval will help analyze important strengths and shortcomings of foundation models, and foster their development and growth for Chinese users.

翻译：新的自然语言处理基准亟需与大型语言模型（LLMs）的快速发展同步。我们提出C-Eval，首个旨在中文语境下评估基础模型高级知识与推理能力的综合性中文评估套件。C-Eval包含涵盖四个难度等级的多项选择题：初中、高中、大学及专业级别。题目横跨52个不同学科，涵盖从人文学科到科学与工程领域。C-Eval配套推出C-Eval Hard，即C-Eval中需要高级推理能力才能解决的极具挑战性学科子集。我们对最先进的大型语言模型（包括面向英语与中文的模型）在C-Eval上进行了全面评估。结果显示，仅GPT-4的平均准确率超过60%，表明当前LLMs仍有显著提升空间。我们预期C-Eval将有助于分析基础模型的重要优势与不足，并促进其为中文用户的发展与成长。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日