Using Learning Progressions to Guide AI Feedback for Science Learning

Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

翻译：生成式人工智能为形成性反馈提供了可扩展的支持，然而大多数AI生成的反馈依赖于领域专家编写的任务特定评分标准。尽管有效，评分标准编写耗时且限制了跨教学情境的可扩展性。学习进阶从理论上表征了学生理解能力的发展过程，可能提供一种替代解决方案。本研究探讨基于学习进阶的评分标准生成流程能否产生与专家编写任务评分标准指导的反馈质量相当的AI生成反馈。我们分析了207名中学生在化学任务中撰写的科学解释所获得的AI生成反馈。比较了两种流程：(a)由人类专家设计的任务特定评分标准指导的反馈，与(b)在评分和反馈生成前从学习进阶自动推导出的任务特定评分标准指导的反馈。两位人工编码员使用多维评分标准评估反馈质量，该标准涵盖清晰度、准确性、相关性、参与度与动机以及反思性（10个子维度）。评分者间信度较高，一致百分比在89%至100%之间，可估计维度的科恩卡帕值（κ = 0.66至0.88）。配对t检验显示两种流程在清晰度（t₁ = 0.00, p₁ = 1.000; t₂ = 0.84, p₂ = .399）、相关性（t₁ = 0.28, p₁ = .782; t₂ = -0.58, p₂ = .565）、参与度与动机（t₁ = 0.50, p₁ = .618; t₂ = -0.58, p₂ = .565）或反思性（t = -0.45, p = .656）方面均无统计学显著差异。这些发现表明基于学习进阶的评分标准流程可作为有效的替代方案。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

美智库《获取生成式人工智能以提升美国防部影响力活动效能》最新报告

专知会员服务

24+阅读 · 2025年7月23日

AI教育的落地深度研究：复盘、对比和商业化

专知会员服务

16+阅读 · 2025年4月3日

如何做好AI研究？哈佛大学Pranav教授《AI研究经验》手册，259页pdf

专知会员服务

55+阅读 · 2025年1月5日