Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two limitations: 1) they fail to leverage item linkage for scenarios such as reading comprehension where multiple items may share a reading passage; 2) they are not scalable since storing one model per item becomes difficult when models have a large number of parameters. In this paper, we report our (grand prize-winning) solution to the National Assessment of Education Progress (NAEP) automated scoring challenge for reading comprehension. Our approach, in-context BERT fine-tuning, produces a single shared scoring model for all items with a carefully-designed input structure to provide contextual information on each item. We demonstrate the effectiveness of our approach via local evaluations using the training dataset provided by the challenge. We also discuss the biases, common error types, and limitations of our approach.
翻译:开放性学生回答的自动评分有潜力显著减少人工评分者的工作量。近期自动评分的进展通常利用基于预训练语言模型(如BERT和GPT)的文本表征作为评分模型的输入。现有方法大多为每个题目/问题单独训练模型,这种方法适用于题目间差异较大的场景(如作文评分)。然而,这些方法存在两个局限:1)在阅读理解等共享同一阅读篇章的多个题目场景中,未能利用题目间的关联性;2)可扩展性差,当模型参数规模较大时,为每个题目单独存储模型变得困难。本文报告了我们在国家教育进步评估(NAEP)阅读理解自动评分挑战中获得的(大奖)解决方案。我们的方法——基于上下文BERT微调——通过精心设计的输入结构为每个题目提供上下文信息,生成了适用于所有题目的统一评分模型。我们使用挑战提供的训练数据集通过本地评估验证了该方法的有效性,并讨论了方法的偏差、常见错误类型及其局限性。