In this paper, we introduce "Marking", a novel grading task that enhances automated grading systems by performing an in-depth analysis of student responses and providing students with visual highlights. Unlike traditional systems that provide binary scores, "marking" identifies and categorizes segments of the student response as correct, incorrect, or irrelevant and detects omissions from gold answers. We introduce a new dataset meticulously curated by Subject Matter Experts specifically for this task. We frame "Marking" as an extension of the Natural Language Inference (NLI) task, which is extensively explored in the field of Natural Language Processing. The gold answer and the student response play the roles of premise and hypothesis in NLI, respectively. We subsequently train language models to identify entailment, contradiction, and neutrality from student response, akin to NLI, and with the added dimension of identifying omissions from gold answers. Our experimental setup involves the use of transformer models, specifically BERT and RoBERTa, and an intelligent training step using the e-SNLI dataset. We present extensive baseline results highlighting the complexity of the "Marking" task, which sets a clear trajectory for the upcoming study. Our work not only opens up new avenues for research in AI-powered educational assessment tools, but also provides a valuable benchmark for the AI in education community to engage with and improve upon in the future. The code and dataset can be found at https://github.com/luffycodes/marking.
翻译:摘要:本文提出了一种新型评分任务——“标记”,它通过对学生回答进行深度分析并提供可视化高亮,增强了自动化评分系统。与仅提供二元分数的传统系统不同,“标记”能识别并分类学生回答中的正确、错误或不相关片段,同时检测标准答案中的遗漏内容。我们引入了由领域专家为这一任务精心构建的新数据集。我们将“标记”定义为自然语言推理任务的扩展——该任务在自然语言处理领域已被广泛探索,其中标准答案与学生回答分别充当自然语言推理中的前提与假设。随后,我们训练语言模型识别学生回答中的蕴含、矛盾与中立关系(类似于自然语言推理任务),并额外增加识别标准答案遗漏的维度。实验采用Transformer模型(具体为BERT与RoBERTa),并利用e-SNLI数据集进行智能训练。我们展示了广泛的基线实验结果,凸显了"标记"任务的复杂性,为后续研究明确了方向。这项工作不仅为AI驱动的教育评估工具开辟了新研究路径,也为AI教育社区提供了可供未来参与改进的宝贵基准。代码与数据集可在https://github.com/luffycodes/marking获取。