Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.
翻译:生成式人工智能为形成性反馈提供了可扩展的支持,然而大多数AI生成的反馈依赖于领域专家编写的任务特定评分标准。尽管有效,评分标准编写耗时且限制了跨教学情境的可扩展性。学习进阶从理论上表征了学生理解能力的发展过程,可能提供一种替代解决方案。本研究探讨基于学习进阶的评分标准生成流程能否产生与专家编写任务评分标准指导的反馈质量相当的AI生成反馈。我们分析了207名中学生在化学任务中撰写的科学解释所获得的AI生成反馈。比较了两种流程:(a)由人类专家设计的任务特定评分标准指导的反馈,与(b)在评分和反馈生成前从学习进阶自动推导出的任务特定评分标准指导的反馈。两位人工编码员使用多维评分标准评估反馈质量,该标准涵盖清晰度、准确性、相关性、参与度与动机以及反思性(10个子维度)。评分者间信度较高,一致百分比在89%至100%之间,可估计维度的科恩卡帕值(κ = 0.66至0.88)。配对t检验显示两种流程在清晰度(t₁ = 0.00, p₁ = 1.000; t₂ = 0.84, p₂ = .399)、相关性(t₁ = 0.28, p₁ = .782; t₂ = -0.58, p₂ = .565)、参与度与动机(t₁ = 0.50, p₁ = .618; t₂ = -0.58, p₂ = .565)或反思性(t = -0.45, p = .656)方面均无统计学显著差异。这些发现表明基于学习进阶的评分标准流程可作为有效的替代方案。