Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations'' and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.
翻译:自动化系统已广泛应用于教育测评行业的开放式问答评分与作文评分领域。这些系统通常能取得与经过培训的人类评分员相当甚至更优的表现,但已有充分证据表明其易受构念无关因素(即与待评估构念无关的作答特征)和对抗性条件的影响。随着大语言模型在自动化评分系统中的日益普及,学界重新聚焦于“幻觉”现象以及基于LLM的自动化评分方法对构念无关因素的鲁棒性。本研究考察了构念无关因素对一种面向情境判断测试中短文式开放式试题评分的双架构LLM评分系统的影响。研究发现,该评分系统对填充无意义文本、拼写错误及写作复杂程度整体上具有鲁棒性。大段文本重复会导致系统预测分数平均降低——这与既往非LLM评分系统的研究结果相矛盾,而离题作答则会受到评分系统的严厉惩罚。这些结果为未来在设计中注重构念相关性的LLM评分系统的鲁棒性提供了令人鼓舞的支持。