Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper, we investigate a collection of models that account for the individual preferences and tendencies of each human scorer in the automated scoring task. We apply these models to a short-answer math response dataset where each response is scored (often differently) by multiple different human scorers. We conduct quantitative experiments to show that our scorer models lead to improved automated scoring accuracy. We also conduct quantitative experiments and case studies to analyze the individual preferences and tendencies of scorers. We found that scorers can be grouped into several obvious clusters, with each cluster having distinct features, and analyzed them in detail.
翻译:对学生开放式问题(包括简答题)回答的自动评分具有规模化处理大量回答的巨大潜力。近期自动评分方法依赖于监督学习,即基于少量带有标注人员提供的分数标签的回答,训练分类器或微调语言模型。然而,由于评分是一个主观过程,这些人工标注分数存在噪声且波动较大,具体取决于评分者。本文研究了一系列模型,这些模型在自动评分任务中考虑了每位评分者的个体偏好与倾向。我们将这些模型应用于一个简答数学题回答数据集,其中每个回答均由多个不同评分者(通常给出不同分数)进行评分。通过定量实验表明,我们的评分者模型能够提升自动评分的准确性。我们还通过定量实验与案例研究分析了评分者的个体偏好与倾向。研究发现,评分者可分为若干明显聚类,每个聚类具有不同特征,并对此进行了详细分析。