Fine-tuning ChatGPT for Automatic Scoring

This study highlights the potential of fine-tuned ChatGPT (GPT-3.5) for automatically scoring student written constructed responses using example assessment tasks in science education. Recent studies on OpenAI's generative model GPT-3.5 proved its superiority in predicting the natural language with high accuracy and human-like responses. GPT-3.5 has been trained over enormous online language materials such as journals and Wikipedia; therefore, more than direct usage of pre-trained GPT-3.5 is required for automatic scoring as students utilize a different language than trained material. These imply that a domain-specific model, fine-tuned over data for specific tasks, can enhance model performance. In this study, we fine-tuned GPT-3.5 on six assessment tasks with a diverse dataset of middle-school and high-school student responses and expert scoring. The six tasks comprise two multi-label and four multi-class assessment tasks. We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT. The results show that in-domain training corpora constructed from science questions and responses for BERT achieved average accuracy = 0.838, SD = 0.069. GPT-3.5 shows a remarkable average increase (9.1%) in automatic scoring accuracy (mean = 9.15, SD = 0.042) for the six tasks, p =0.001 < 0.05. Specifically, for multi-label tasks (item 1 with 5 labels; item 2 with 10 labels), GPT-3.5 achieved significantly higher scoring accuracy than BERT across all the labels, with the second item achieving a 7.1% increase. The average scoring increase for the four multi-class items for GPT-3.5 was 10.6% compared to BERT. Our study confirmed the effectiveness of fine-tuned GPT-3.5 for automatic scoring of student responses on domain-specific data in education with high accuracy. We have released fine-tuned models for public use and community engagement.

翻译：本研究聚焦于科学教育中通过示例评估任务，探讨微调版ChatGPT（GPT-3.5）对学生书面构建反应进行自动评分的潜力。近期关于OpenAI生成式模型GPT-3.5的研究证明了其在高精度预测自然语言及生成类人反应方面的优越性。由于GPT-3.5基于期刊、维基百科等海量在线语言材料训练，而学生使用的语言与训练材料存在差异，因此直接使用预训练GPT-3.5进行自动评分尚不足够。这表明，针对特定任务数据微调的领域专用模型可提升模型性能。本研究基于包含中学生和高中生多样化反应数据及专家评分的六项评估任务，对GPT-3.5进行微调。六项任务包括两项多标签任务和四项多分类任务。我们将微调GPT-3.5与经过微调的谷歌最先进生成语言模型BERT进行性能对比。结果表明，基于科学问题与反应构建的领域内训练语料，BERT的平均准确率为0.838（标准差=0.069）。GPT-3.5在六项任务中的自动评分准确率平均提升9.1%（均值=9.15，标准差=0.042），p=0.001<0.05，表现显著。具体而言，对于多标签任务（任务1含5个标签；任务2含10个标签），GPT-3.5在所有标签上的评分准确率均显著高于BERT，其中任务2准确率提升7.1%。与BERT相比，GPT-3.5在四项多分类任务中的平均评分提升达10.6%。本研究证实，微调GPT-3.5能够以高准确率实现教育领域特定数据中学生反应的自动评分。我们已向公众开放微调模型以供使用和社区参与。