SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Tu Anh Dinh,Carlos Mullov,Leonard Bärmann,Zhaolin Li,Danni Liu,Simon Reiß,Jueun Lee,Nathan Lerzer,Fabian Ternava,Jianfeng Gao,Tobias Röddiger,Alexander Waibel,Tamim Asfour,Michael Beigl,Rainer Stiefelhagen,Carsten Dachsbacher,Klemens Böhm,Jan Niehues

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.

翻译：随着大型语言模型（LLM）的快速发展，建立能够评估LLM在不同领域能力的基准测试至关重要。LLM的常见应用之一是执行科学主题任务，例如编写算法、查询数据库或提供数学证明。受大学生在此类任务中接受评估方式的启发，本文提出SciEx——一个由大学计算机科学考试题目构成的基准测试，用于评估LLM解决科学任务的能力。SciEx具有以下特点：（1）多语言性：包含英语和德语试题；（2）多模态性：包含涉及图像的题目；（3）因大学考试性质而包含不同难度级别的多种自由形式问题。我们在新基准上评估了多种前沿LLM的表现。由于SciEx题目为自由形式，直接评估LLM性能具有挑战性，因此我们提供了专家对LLM输出的评分。研究表明，SciEx中的自由形式考试对当前LLM仍具挑战性，最佳LLM平均仅获得59.4%的考试分数。我们还详细比较了LLM与学生在SciEx上的表现差异。为支持未来新LLM的评估，我们提出采用LLM-as-a-judge方法对SciEx的LLM答案进行评分。实验表明，尽管LLM在解题方面表现未臻完美，但其作为评分者表现良好，与专家评分的皮尔逊相关系数达到0.948。