Fine-tuning ChatGPT for Automatic Scoring

This study highlights the potential of fine-tuned ChatGPT (GPT-3.5) for automatically scoring student written constructed responses using example assessment tasks in science education. Recent studies on OpenAI's generative model GPT-3.5 proved its superiority in predicting the natural language with high accuracy and human-like responses. GPT-3.5 has been trained over enormous online language materials such as journals and Wikipedia; therefore, more than direct usage of pre-trained GPT-3.5 is required for automatic scoring as students utilize a different language than trained material. These imply that a domain-specific model, fine-tuned over data for specific tasks, can enhance model performance. In this study, we fine-tuned GPT-3.5 on six assessment tasks with a diverse dataset of middle-school and high-school student responses and expert scoring. The six tasks comprise two multi-label and four multi-class assessment tasks. We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT. The results show that in-domain training corpora constructed from science questions and responses for BERT achieved average accuracy = 0.838, SD = 0.069. GPT-3.5 shows a remarkable average increase (9.1%) in automatic scoring accuracy (mean = 9.15, SD = 0.042) for the six tasks, p =0.001 < 0.05. Specifically, for multi-label tasks (item 1 with 5 labels; item 2 with 10 labels), GPT-3.5 achieved significantly higher scoring accuracy than BERT across all the labels, with the second item achieving a 7.1% increase. The average scoring increase for the four multi-class items for GPT-3.5 was 10.6% compared to BERT. Our study confirmed the effectiveness of fine-tuned GPT-3.5 for automatic scoring of student responses on domain-specific data in education with high accuracy. We have released fine-tuned models for public use and community engagement.

翻译：本研究揭示了微调版ChatGPT（GPT-3.5）在科学教育范例评估任务中，针对学生书面建构性回答进行自动评分的潜力。关于OpenAI生成式模型GPT-3.5的最新研究已证明其在自然语言预测方面具备高准确率与类人应答的卓越能力。由于GPT-3.5基于海量在线语言材料（如期刊和维基百科）训练，而学生用语与训练材料存在差异，因此直接使用预训练GPT-3.5进行自动评分尚不充分。这表明，针对特定任务数据微调的领域专用模型能够提升模型性能。本研究基于六项评估任务对GPT-3.5进行微调，所用数据集包含初中生与高中生的多样化回答及专家评分。六项任务包括两项多标签分类任务和四项多类别分类任务。我们将微调版GPT-3.5与微调版谷歌前沿语言模型BERT进行性能比较。结果表明，基于科学问题与回答构建的领域内训练语料库中，BERT的平均准确率为0.838（标准差=0.069）。GPT-3.5在六项任务中的自动评分准确率显著提升（平均提升9.1%，均值=9.15，标准差=0.042，p=0.001<0.05）。具体而言，在多标签任务（任务1含5个标签，任务2含10个标签）中，GPT-3.5在所有标签上的评分准确率均显著高于BERT，其中第二项任务准确率提升7.1%。GPT-3.5在四项多类别任务中的平均评分提升较BERT达10.6%。本研究证实，针对教育领域领域特定数据微调的GPT-3.5能够以高准确率实现学生回答的自动评分。我们已公开发布微调模型，供公众使用与社区参与。