SciEx：基于专家人工评分与自动评分的大型语言模型科学考试基准测试 (SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading)

Tu Anh Dinh,Carlos Mullov,Leonard Bärmann,Zhaolin Li,Danni Liu,Simon Reiß,Jueun Lee,Nathan Lerzer,Fabian Ternava,Jianfeng Gao,Tobias Röddiger,Alexander Waibel,Tamim Asfour,Michael Beigl,Rainer Stiefelhagen,Carsten Dachsbacher,Klemens Böhm,Jan Niehues

from arxiv, Accepted to EMNLP 2024 Main Conference

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.

翻译：随着大型语言模型（LLM）的快速发展，建立能够评估LLM在不同领域能力的基准测试至关重要。LLM的常见应用之一是处理科学主题任务，例如编写算法、查询数据库或提供数学证明。受大学中此类任务考核方式的启发，本文提出SciEx——一个由大学计算机科学考试题目构成的基准测试，用于评估LLM解决科学任务的能力。SciEx具有以下特点：（1）多语言性：包含英语和德语试题；（2）多模态性：包含涉及图像的题目；（3）题型多样性：由于大学考试的性质，包含多种难度级别的自由形式问答题。我们在新基准上评估了多种前沿LLM的表现。鉴于SciEx题目为自由形式，直接评估LLM性能具有挑战性，因此我们提供了专家对LLM输出的人工评分。研究表明，SciEx中的自由形式考试对当前LLM仍具挑战性，最佳LLM平均仅获得59.4%的考试成绩。我们还详细比较了LLM与学生在SciEx上的表现差异。为支持未来新LLM的评估，我们提出使用LLM作为评判器对SciEx答案进行评分。实验表明，尽管LLM在解题方面表现未尽完美，但其作为评分者表现良好，与专家评分的皮尔逊相关系数达到0.948。