This study explores the feasibility of using large language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in an undergraduate Mechanical Engineering course. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University, each answered by approximately 225 students. Both the LLM and TAs followed the same instructor-provided rubric to ensure grading consistency. We evaluated performance using Spearman's rank correlation coefficient and Root Mean Square Error (RMSE) to assess the alignment between rankings and the accuracy of scores assigned by GPT-4o and TAs under zero- and few-shot grading settings. In the zero-shot setting, GPT-4o demonstrated a strong correlation with TA grading, with Spearman's rank correlation coefficient exceeding 0.6 in seven out of ten datasets and reaching a high of 0.9387. Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers, particularly those involving synonyms not present in the rubric. The model also tends to grade more stringently in ambiguous cases compared to human TAs. Overall, ChatGPT shows promise as a tool for grading conceptual questions, offering scalability and consistency.
翻译:本研究探讨了使用大型语言模型(LLMs),特别是GPT-4o(ChatGPT),对本科机械工程课程中的概念性问题进行自动评分的可行性。我们在德克萨斯农工大学MEEN 361课程的十道测验题目上,比较了GPT-4o与人类助教(TAs)的评分表现,每道题目约有225名学生作答。LLM和助教均遵循相同的教师提供的评分标准,以确保评分一致性。我们使用斯皮尔曼等级相关系数和均方根误差(RMSE)来评估性能,以衡量在零样本和少样本评分设置下,GPT-4o与助教在排名对齐性和分数准确性方面的表现。在零样本设置中,GPT-4o与助教评分表现出强相关性,在十个数据集中的七个中,斯皮尔曼等级相关系数超过0.6,最高达到0.9387。我们的分析表明,GPT-4o在评分标准明确时表现良好,但在处理细微差别的答案时存在困难,特别是那些涉及评分标准中未出现的同义词的答案。与人类助教相比,该模型在模糊情况下也倾向于更严格地评分。总体而言,ChatGPT作为概念性问题评分的工具显示出潜力,提供了可扩展性和一致性。