Modeling and Analyzing Scorer Preferences in Short-Answer Math Questions

Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper, we investigate a collection of models that account for the individual preferences and tendencies of each human scorer in the automated scoring task. We apply these models to a short-answer math response dataset where each response is scored (often differently) by multiple different human scorers. We conduct quantitative experiments to show that our scorer models lead to improved automated scoring accuracy. We also conduct quantitative experiments and case studies to analyze the individual preferences and tendencies of scorers. We found that scorers can be grouped into several obvious clusters, with each cluster having distinct features, and analyzed them in detail.

翻译：对学生开放式问题（包括简答题）回答的自动评分具有规模化处理大量回答的巨大潜力。近期自动评分方法依赖于监督学习，即基于少量带有标注人员提供的分数标签的回答，训练分类器或微调语言模型。然而，由于评分是一个主观过程，这些人工标注分数存在噪声且波动较大，具体取决于评分者。本文研究了一系列模型，这些模型在自动评分任务中考虑了每位评分者的个体偏好与倾向。我们将这些模型应用于一个简答数学题回答数据集，其中每个回答均由多个不同评分者（通常给出不同分数）进行评分。通过定量实验表明，我们的评分者模型能够提升自动评分的准确性。我们还通过定量实验与案例研究分析了评分者的个体偏好与倾向。研究发现，评分者可分为若干明显聚类，每个聚类具有不同特征，并对此进行了详细分析。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

不可错过！700+ppt《因果推理》课程！杜克大学Fan Li教程

专知会员服务

73+阅读 · 2022年7月11日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日