Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The codes and data are publicly available on https://github.com/OpenDFM/SciEval.
翻译:近年来,利用大语言模型(LLMs)进行科学研究日益受到关注。已有众多基准被提出以评估LLMs的科学研究能力。然而,现有基准大多基于预先收集的客观题目。这种设计存在数据泄露问题,且缺乏对主观问答能力的评估。本文提出SciEval,一个全面、多学科的综合评估基准以应对上述问题。基于布鲁姆教育目标分类法,SciEval涵盖四个维度以系统评估科学研究能力。特别地,我们依据科学原理设计了“动态”子集,以防止评估受到潜在数据泄露的影响。SciEval同时包含客观题与主观题。这些特性使得SciEval成为评估LLMs科学研究能力更有效的基准。在多数先进LLMs上的综合实验表明,尽管GPT-4相比其他LLMs取得了最优性能,但其仍有显著提升空间,尤其在动态问题上。代码与数据已公开于https://github.com/OpenDFM/SciEval。