Using GPT-4 to Augment Unbalanced Data for Automatic Scoring

Machine learning-based automatic scoring can be challenging if students' responses are unbalanced across scoring categories, as it introduces uncertainty in the machine training process. To meet this challenge, we introduce a novel text data augmentation framework using GPT-4, a generative large language model, specifically tailored for unbalanced datasets in automatic scoring. Our experimental dataset comprised student-written responses to two science items. We crafted prompts for GPT-4 to generate responses resembling student-written answers, particularly for the minority scoring classes, to augment the data. We then finetuned DistillBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and F1 score. We incorporate varied amounts of augmented data to examine scoring performance, and our findings revealed remarkedly improved model performance. The average maximum increase observed across two items is: 3.5% for accuracy, 30.6% for precision, 21.1% for recall, and 24.2% for F1 score. Notably, using just 5% of the augmented data led to substantial improvements: 2.6%, 29.2%, 15.1%, and 19.6%. Interestingly, the extent of improvement varied depending on specific datasets. Moreover, we found that a varying amount of augmented data (5%-40%) was needed to obtain a stable improvement. We also compare models trained with GPT-4 augmented data and those trained with additional student-written responses. The findings indicate that former ones match or even exceed the performance of the latter. Specifically, there is an average difference of 1.7%, 1.9%, 11.0%, and 7.8% for four metrics separately. This research underscores the potential and effectiveness of data augmentation techniques utilizing GPT-4 in addressing unbalanced datasets within automated assessment.

翻译：基于机器学习的自动评分在学生的回答在各评分类别间不平衡时可能面临挑战，因为这会在机器训练过程中引入不确定性。为应对这一挑战，我们提出了一种新颖的文本数据增强框架，采用生成式大语言模型GPT-4，专门针对自动评分中的不平衡数据集进行设计。我们的实验数据集包含学生对两个科学问题的书面回答。我们精心设计了提示词，让GPT-4生成类似学生书写的回答，特别是针对少数评分类别，以增强数据。然后，我们基于增强后的数据集和原始数据集微调了DistillBERT用于自动评分。模型性能通过准确率、精确率、召回率和F1分数进行评估。我们加入了不同数量的增强数据以检验评分性能，结果发现模型性能显著提升。在两个问题上观测到的平均最大提升为：准确率3.5%，精确率30.6%，召回率21.1%，F1分数24.2%。值得注意的是，仅使用5%的增强数据便带来了显著改进：准确率2.6%，精确率29.2%，召回率15.1%，F1分数19.6%。有趣的是，改进程度因具体数据集而异。此外，我们发现需要不同数量的增强数据（5%-40%）才能获得稳定的改进。我们还比较了使用GPT-4增强数据训练的模型和使用额外学生书面回答训练的模型。结果表明，前者的性能与后者相当甚至更优。具体而言，四个指标的平均差异分别为1.7%、1.9%、11.0%和7.8%。本研究凸显了利用GPT-4的数据增强技术在解决自动评估中不平衡数据集问题方面的潜力和有效性。