As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
翻译:随着大型语言模型(LLMs)能力的持续提升,对其性能的评估变得愈发关键且富有挑战性。本文旨在通过引入CMMLU(一个覆盖自然科学、社会科学、工程学和人文学科等众多学科的综合中文基准)来弥合这一差距。我们对18个先进的多语言及中文导向型LLMs进行了全面评估,考察了它们在不同学科和设置下的表现。结果表明,即使提供了上下文示例和思维链提示,大多数现有LLMs的平均准确率仍难以达到50%,而随机基线为25%。这凸显了LLMs存在显著的改进空间。此外,我们开展了广泛实验以识别影响模型性能的因素,并提出了改进LLMs的方向。CMMLU填补了在中文语境下评估大型语言模型知识与推理能力的空白。