Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs

In this paper, we investigate the potential of open-source Large Language Models (LLMs) for grading Unified Modeling Language (UML) class diagrams. In contrast to existing work, which primarily evaluates proprietary LLMs, we focus on non-proprietary models, making our approach suitable for universities where transparency and cost are critical. Additionally, existing studies assess performance over complete diagrams rather than individual criteria, offering limited insight into how automated grading aligns with human evaluation. To address these gaps, we propose a grading pipeline in which student-generated UML class diagrams are independently evaluated by both teaching assistants (TAs) and LLMs. Grades are then compared at the level of individual criteria. We evaluate this pipeline through a quantitative study of 92 UML class diagrams from a software design course, comparing TA grades against assessments produced by six popular open-source LLMs. Performance is measured across individual criterion, highlighting areas where LLMs diverge from human graders. Our results show per-criterion accuracy of up to 88.56% and a Pearson correlation coefficient of up to 0.78, representing a substantial improvement over previous work while using only open-source models. We also explore the concept of an optimal model that combines the best-performing LLM per criterion. This optimal model achieves performance close to that of a TA, suggesting a possible path toward a mixed-initiative grading system. Our findings demonstrate that open-source LLMs can effectively support UML class diagram grading by explicitly identifying grading alignment. The proposed pipeline provides a practical approach to manage increasing assessment workloads with growing student counts.

翻译：本文研究了开源大型语言模型（LLM）在统一建模语言（UML）类图评分中的应用潜力。与现有主要评估专有LLM的研究不同，我们聚焦于非专有模型，使该方法适用于对透明度和成本有严格要求的大学环境。此外，现有研究多针对完整图表而非单项评分标准进行评估，难以揭示自动化评分与人工评价的一致性程度。为填补这些空白，我们提出一种评分流程：由助教（TA）和LLM分别独立评估学生提交的UML类图，并在单项评分标准层面进行分数比对。我们通过对软件设计课程中92份UML类图的定量研究验证该流程，将助教评分与六种主流开源LLM的评估结果进行对比。研究通过逐项标准衡量性能，重点揭示LLM与人工评分存在差异的环节。结果显示：单项标准准确率最高达88.56%，皮尔逊相关系数最高达0.78，仅使用开源模型即实现了较先前研究的显著提升。我们还探索了"最优模型"概念——整合各标准表现最佳的LLM，该模型能达到接近助教的评分水平，为构建人机协同评分系统提供了可行路径。本研究表明，通过明确识别评分一致性，开源LLM能有效支持UML类图评分工作。所提出的流程为应对学生数量增长带来的评估压力提供了切实可行的解决方案。