Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold great potential for automating academic grading, their reliability for this specialized task remains understudied, particularly regarding grading consistency, the lack of which represents a primary obstacle to educational fairness. This paper proposes a human-aligned LLM-assisted grading workflow and presents a case study based on 180 student submissions from a graduate advanced software engineering course. We evaluate two mainstream LLMs, Grok and GPT, in terms of grading consistency and alignment with human scores. We find LLMs exhibit distinct levels of intra-model consistency and significant inter-model grading inconsistencies, while simple ensemble approaches cannot improve alignment with human evaluation. Critically, continuous interaction history drives systematic drift in models' grading standards away from human expert scores. Our findings demonstrate LLMs' potential in reducing grading workload for educators in graduate education, while highlighting that indiscriminate LLM grading may introduce systemic unfairness, suggesting that specific operational practices are required to mitigate such disparities.
翻译:研究生层次的研究阅读报告评估给教育工作者带来了沉重的劳动负担。尽管大型语言模型(LLM)在自动化评分方面具有巨大潜力,但其在此特殊任务中的可靠性仍缺乏充分研究,尤其是评分一致性不足——这构成了教育公平的主要障碍。本文提出一种人工对齐的LLM辅助评分工作流,并以某研究生高级软件工程课程的180份学生提交物为研究对象开展案例研究。我们评估了Grok和GPT两款主流LLM在评分一致性与人类评分对齐性方面的表现。研究发现,LLM展现出显著不同的模型内一致性水平及跨模型评分差异,而简单的集成方法无法改进与人类评估的对齐程度。关键的是,持续交互历史会系统性驱动模型评分标准偏离人类专家评分。我们的研究结果表明,LLM在减轻研究生教育工作者评分负担方面具有潜力,同时指出不加区分地使用LLM评分可能引入系统性不公,并表明需要特定的操作规范来缓解此类差异性。