Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The data and codes are now publicly available.
翻译:近期,利用大语言模型进行科学研究引起广泛关注。研究者提出了众多基准来评估大语言模型的科研能力。然而,当前基准大多基于预收集的客观题,这种设计存在数据泄露问题,且缺乏对主观问答能力的评估。本文提出SciEval——一个综合多学科评估基准以解决上述问题。基于布鲁姆分类学,SciEval从四个维度系统评估科研能力。特别地,我们设计了基于科学原理的"动态"子集以避免潜在数据泄露对评估的干扰。该基准同时包含客观题与主观题,这些特性使SciEval成为更有效的大语言模型科研能力评估基准。在最先进大语言模型上的综合实验表明,尽管GPT-4相比其他模型取得了最优性能,但仍有显著改进空间,尤其在动态题目上。相关数据与代码现已公开。