Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.
翻译:自动化简答题评分(ASAG)是机器学习研究领域一个超过十年的活跃方向。其目标是在大规模课程中,尽管人工评卷员有限,仍能让教育工作者对开放式答案进行评分并提供反馈。多年来,经过精细训练的模型已实现越来越高的性能水平。近年,预训练大语言模型(LLM)作为通用工具涌现,一个引人深思的问题是:未经额外训练的通用工具与专用模型相比表现如何?我们研究了GPT-4在标准二元及三元数据集SciEntsBank和Beetle上的性能,除了学生答案与参考答案对齐程度评分的标准任务外,还探究了未提供参考答案的情况。研究发现,总体而言,预训练通用型GPT-4 LLM的性能与手工设计的模型相当,但逊于经过专门训练的预训练LLM。