Large language models (LLMs) enable rapid and consistent automated evaluation of open-ended exam responses, including dimensions of content and argumentation that have traditionally required human judgment. This is particularly important in cases where a large amount of exams need to be graded in a limited time frame, such as nation-wide graduation exams in various countries. Here, we examine the applicability of automated scoring on two large datasets of trial exam essays of two full national cohorts from Estonia. We operationalize the official curriculum-based rubric and compare LLM and statistical natural language processing (NLP) based assessments with human panel scores. The results show that automated scoring can achieve performance comparable to that of human raters and tends to fall within the human scoring range. We also evaluate bias, prompt injection risks, and LLMs as essay writers. These findings demonstrate that a principled, rubric-driven, human-in-the-loop scoring pipeline is viable for high-stakes writing assessment, particularly relevant for digitally advanced societies like Estonia, which is about to adapt a fully electronic examination system. Furthermore, the system produces fine-grained subscore profiles that can be used to generate systematic, personalized feedback for instruction and exam preparation. The study provides evidence that LLM-assisted assessment can be implemented at a national scale, even in a small-language context, while maintaining human oversight and compliance with emerging educational and regulatory standards.
翻译:大语言模型(LLMs)能够对开放式考试答案进行快速、一致的自动化评估,涵盖传统上依赖人工判断的内容与论证维度。这在需要短时间内批改大量试卷的场景中尤为重要,例如各国的全国性毕业考试。本研究基于爱沙尼亚两个完整全国性学生群体的模拟考试作文数据集,探讨自动化评分的适用性。我们依据官方课程标准制定评分细则,并将基于大语言模型与统计自然语言处理(NLP)的评估结果与专家小组评分进行对比。结果表明,自动化评分能达到与人工评分者相当的绩效水平,且通常落在人工评分区间内。我们还评估了偏差风险、提示注入攻击风险以及大语言模型作为作文生成器的影响。这些发现证明,基于评分细则、人机协同的评分流程对于高风险写作评估具有可行性,尤其适用于爱沙尼亚这类即将全面推行电子化考试系统的数字化先进社会。此外,该系统可生成细粒度的分项评分图谱,用于为教学与备考提供系统性、个性化的反馈。本研究证实,即使在小型语言环境中,也能在保持人工监督、符合新兴教育及监管标准的前提下,实现全国范围内的大语言模型辅助评估。