Providing explainable and faithful feedback is crucial for automated student answer assessment. In this paper, we introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation. We identify the appropriate instructions by prompting ChatGPT with different templates to collect the rationales, where inconsistent rationales are refined to align with marking standards. The refined ChatGPT outputs enable us to fine-tune a smaller language model that simultaneously assesses student answers and provides rationales. Extensive experiments on the benchmark dataset show that the proposed method improves the overall QWK score by 11% compared to ChatGPT. Furthermore, our thorough analysis and human evaluation demonstrate that the rationales generated by our proposed method are comparable to those of ChatGPT. Our approach provides a viable solution to achieve explainable automated assessment in education. Code available at https://github.com/lijiazheng99/aera.
翻译:提供可解释且忠实的反馈对于自动学生答案评估至关重要。本文提出了一种新颖框架,探索利用ChatGPT这一前沿大语言模型,同时完成学生答案评分与理由生成任务。我们通过不同提示模板引导ChatGPT生成理由,并识别出适当的指令集;针对不一致的理由进行精炼以符合评分标准。精炼后的ChatGPT输出可用于微调更小型的语言模型,使其能够同时评估学生答案并提供理由。在基准数据集上的广泛实验表明,相比ChatGPT,所提方法将整体QWK分数提升了11%。此外,我们的深入分析与人工评估证明,该方法生成的解释质量与ChatGPT相当。本研究为实现教育领域可解释的自动评估提供了可行方案。代码详见https://github.com/lijiazheng99/aera。