Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability

Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability, contributing practical insights for developing effective AES models in real-world education. To this end, we meticulously selected nine prominent AES methods and evaluated their performance using seven metrics on an open-sourced dataset, which contains over 25,000 essays and various demographic information about students such as gender, English language learner status, and economic status. Through extensive evaluations, we demonstrated that: (1) prompt-specific models tend to outperform their cross-prompt counterparts in terms of predictive accuracy; (2) prompt-specific models frequently exhibit a greater bias towards students of different economic statuses compared to cross-prompt models; (3) in the pursuit of generalizability, traditional machine learning models coupled with carefully engineered features hold greater potential for achieving both high accuracy and fairness than complex neural network models.

翻译：自动作文评分（AES）是一项成熟的教育研究方向，利用机器学习评估学生撰写的作文。尽管该领域已投入大量努力，当前研究主要聚焦于：(i) 提升针对特定题目的AES模型的预测准确性（即开发题目特定模型），这通常高度依赖同一目标题目标注数据的使用；或(ii) 评估基于非目标题目开发的AES模型在预期目标题目上的适用性（即在跨题目情境下开发AES模型）。鉴于机器学习固有的偏差及其对边缘化群体的潜在影响，亟需探究当前AES方法中是否存在此类偏差，若存在，它如何干扰AES模型的准确性与泛化能力。因此，本研究旨在揭示AES模型准确性、公平性与泛化能力之间的复杂关系，为在实际教育中开发有效AES模型提供实践启示。为此，我们精心挑选了九种主流AES方法，使用七个指标对包含超过25,000篇作文及学生人口统计信息（如性别、英语学习者身份、经济状况）的开源数据集进行评估。通过广泛评估，我们证明了：(1) 题目特定模型在预测准确性上通常优于跨题目模型；(2) 相比跨题目模型，题目特定模型对不同经济地位学生的偏差更为显著；(3) 在追求泛化能力时，结合精心设计特征的传统机器学习模型比复杂神经网络模型更有可能同时实现高准确性与公平性。