Developing models to automatically score students' written responses to science problems is critical for science education. However, collecting and labeling sufficient student responses for training models is time and cost-consuming. Recent studies suggest that pre-trained language models (PLMs) can be adapted to downstream tasks without fine-tuning with prompts. However, no research has employed such a prompt approach in science education. As student responses are presented with natural language, aligning the scoring procedure as the next sentence prediction task using prompts can skip the costly fine-tuning stage. In this study, we developed a zero-shot approach to automatically score student responses via Matching Exemplars as Next Sentence Prediction (MeNSP). This approach employs no training samples. We first apply MeNSP in scoring three assessment tasks of scientific argumentation and found machine-human scoring agreements, Cohen's Kappa ranges from 0.30 to 0.57, and F1 score ranges from 0.54 to 0.81. To improve the performance, we extend our research to the few-shots setting, either randomly selecting labeled student responses or manually constructing responses to fine-tune the models. We find that one task's performance is improved with more samples, Cohen's Kappa from 0.30 to 0.38, and F1 score from 0.54 to 0.59; for the two others, scoring performance is not improved. We also find that randomly selected few-shots perform better than the human expert-crafted approach. This study suggests that MeNSP can yield referable automatic scoring for student responses while significantly reducing the cost of model training. This method can benefit low-stakes classroom assessment practices in science education. Future research should further explore the applicability of the MeNSP in different types of assessment tasks in science education and improve the model performance.
翻译:开发模型以自动评分学生对科学问题的书面回答对于科学教育至关重要。然而,收集和标注足够的学生回答来训练模型既耗时又昂贵。最近的研究表明,预训练语言模型(PLMs)可以通过提示适应下游任务而无需微调。然而,尚无研究在科学教育中采用这种提示方法。由于学生回答以自然语言形式呈现,将评分流程与使用提示的下一句预测任务对齐,可以跳过昂贵的微调阶段。在本研究中,我们开发了一种零样本方法,通过匹配样例作为下一句预测(MeNSP)来自动评分学生回答。该方法无需训练样本。我们首先将MeNSP应用于评分三个科学论证评估任务,发现机器与人类评分的一致性Cohen's Kappa范围为0.30至0.57,F1分数范围为0.54至0.81。为提升性能,我们将研究扩展到少样本设置,即随机选择标注的学生回答或手动构建回答来微调模型。我们发现,一个任务的性能随着样本增加而提升,Cohen's Kappa从0.30提升至0.38,F1分数从0.54提升至0.59;对于另外两个任务,评分性能未提升。我们还发现,随机选择的少样本优于人类专家构建的方法。本研究表明,MeNSP可以在显著降低模型训练成本的同时,为学生的回答提供可参考的自动评分。该方法可有益于科学教育中低风险课堂评估实践。未来研究应进一步探索MeNSP在科学教育不同类型评估任务中的适用性,并提升模型性能。