Automated Long Answer Grading with RiceChem Dataset

We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG). Distinguishing itself from Automated Short Answer Grading (ASAG) and Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To study ALAG, we introduce RiceChem, a dataset derived from a college chemistry course, featuring real student responses to long-answer questions with an average word count notably higher than typical ASAG datasets. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models to verify whether each criterion, represented by a rubric item, is addressed in the student's response. This formulation enables the effective use of MNLI for transfer learning, significantly improving the performance of models on the RiceChem dataset. We demonstrate the importance of rubric-based formulation in ALAG, showcasing its superiority over traditional score-based approaches in capturing the nuances of student responses. We also investigate the performance of models in cold start scenarios, providing valuable insights into the practical deployment considerations in educational settings. Lastly, we benchmark state-of-the-art open-sourced Large Language Models (LLMs) on RiceChem and compare their results to GPT models, highlighting the increased complexity of ALAG compared to ASAG. Despite leveraging the benefits of a rubric-based approach and transfer learning from MNLI, the lower performance of LLMs on RiceChem underscores the significant difficulty posed by the ALAG task. With this work, we offer a fresh perspective on grading long, fact-based answers and introduce a new dataset to stimulate further research in this important area. Code: \url{https://github.com/luffycodes/Automated-Long-Answer-Grading}.

翻译：我们在教育自然语言处理领域引入了一个新的研究方向：自动长答案评分（ALAG）。与自动短答案评分（ASAG）和自动论文评分（AEG）不同，ALAG面临独特挑战，这源于基于事实的长答案的复杂性和多面性。为研究ALAG，我们提出了RiceChem数据集，该数据集源自大学化学课程，包含学生对长答题的真实回答，其平均词数显著高于典型的ASAG数据集。我们提出了一种新颖的ALAG方法，将其形式化为一个评估标准蕴含问题，利用自然语言推理模型来验证学生回答中是否涵盖了每个评估标准项所代表的标准。这种形式化使得MNLI能够有效用于迁移学习，显著提高了模型在RiceChem数据集上的性能。我们证明了基于评估标准的ALAG方法的重要性，展示了其在捕捉学生回答细微差别方面优于传统的基于分数的方法。我们还研究了模型在冷启动场景下的性能，为教育场景中的实际部署考虑提供了宝贵见解。最后，我们在RiceChem上对最先进的开源大型语言模型（LLM）进行了基准测试，并将其结果与GPT模型进行了比较，凸显了ALAG相较于ASAG的更高复杂性。尽管利用了基于评估标准的方法和MNLI的迁移学习优势，但LLM在RiceChem上较低的性能突显了ALAG任务的显著难度。通过这项工作，我们为基于事实的长答案评分提供了全新视角，并引入了一个新数据集以推动这一重要领域的进一步研究。代码：\url{https://github.com/luffycodes/Automated-Long-Answer-Grading}。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日