Grading exams is an important, labor-intensive, subjective, repetitive, and frequently challenging task. The feasibility of autograding textual responses has greatly increased thanks to the availability of large language models (LLMs) such as ChatGPT and the substantial influx of data brought about by digitalization. However, entrusting AI models with decision-making roles raises ethical considerations, mainly stemming from potential biases and issues related to generating false information. Thus, in this manuscript, we provide an evaluation of a large language model for the purpose of autograding, while also highlighting how LLMs can support educators in validating their grading procedures. Our evaluation is targeted towards automatic short textual answers grading (ASAG), spanning various languages and examinations from two distinct courses. Our findings suggest that while "out-of-the-box" LLMs provide a valuable tool to provide a complementary perspective, their readiness for independent automated grading remains a work in progress, necessitating human oversight.
翻译:试卷评分是一项重要、劳动密集、主观性强、重复性高且常常具有挑战性的任务。得益于ChatGPT等大语言模型(LLMs)的可用性以及数字化带来的海量数据,对文本回答进行自动评分的可行性已大幅提高。然而,将决策角色委托给AI模型引发了伦理考量,这主要源于潜在的偏见以及可能生成虚假信息的问题。因此,在本研究中,我们针对自动评分的目的评估了一个大语言模型,同时也强调了LLMs如何能够支持教育工作者验证其评分流程。我们的评估聚焦于自动短文本答案评分(ASAG),涵盖两种不同课程中多种语言和类型的考试。我们的研究结果表明,虽然"开箱即用"的LLMs为提供补充性视角提供了有价值的工具,但其独立进行自动化评分的准备度仍有待完善,需要人类的监督。