Assessing student answers and providing valuable feedback is crucial for effective learning, but it can be a time-consuming task. Traditional methods of automating student answer assessment through text classification often suffer from issues such as lack of trustworthiness, transparency, and the ability to provide a rationale for the automated assessment process. These limitations hinder their usefulness in practice. In this paper, we explore using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation under both the zero-shot and few-shot settings. We introduce a critic module which automatically filters incorrect outputs from ChatGPT and utilizes the remaining ChtaGPT outputs as noisy labelled data to fine-tune a smaller language model, enabling it to perform student answer scoring and rationale generation. Moreover, by drawing multiple samples from ChatGPT outputs, we are able to compute predictive confidence scores, which in turn can be used to identify corrupted data and human label errors in the training set. Our experimental results demonstrate that despite being a few orders of magnitude smaller than ChatGPT, the fine-tuned language model achieves better performance in student answer scoring. Furthermore, it generates more detailed and comprehensible assessments than traditional text classification methods. Our approach provides a viable solution to achieve explainable automated assessment in education.
翻译:评估学生答案并提供有价值的反馈对有效学习至关重要,但这可能是一项耗时的任务。通过文本分类实现学生答案评估自动化的传统方法常面临可信度不足、缺乏透明度以及无法为自动化评估过程提供理由等问题。这些局限性阻碍了它们在实际中的应用。本文探索使用前沿大语言模型ChatGPT,在零样本和少样本设置下同时完成学生答案评分与理由生成任务。我们引入一个批评模块,自动过滤ChatGPT的错误输出,并利用剩余ChatGPT输出作为含噪标注数据来微调较小语言模型,使其能够执行学生答案评分与理由生成。此外,通过从ChatGPT输出中多次采样,我们能够计算预测置信度分数,进而可用于识别训练集中的损坏数据及人工标注错误。实验结果表明,尽管微调语言模型的规模比ChatGPT小数个数量级,但在学生答案评分中取得了更优性能,并且生成了比传统文本分类方法更详细、更易理解的评估结果。我们的方法为实现教育领域可解释的自动化评估提供了可行方案。