Large language models (LLMs) are increasingly used to provide automated feedback in introductory programming courses, yet empirical evidence from authentic classroom deployments comparing different feedback modalities remains limited. In this work, we present a large-scale classroom study in which AI-generated feedback was deployed through a randomized protocol in an introductory Python programming course. Students received one of three feedback conditions on incorrect submissions: natural language hints, AI-generated failing test cases, or no AI feedback. We release the resulting dataset, ProgFeed, which captures 6,693 submissions from 215 consenting students across 17 labs, including feedback conditions, execution-based performance measures, and fine-grained temporal information. Using this data, we analyze learning trajectories, feedback quality, and submission behavior over repeated attempts. We find that natural language feedback is significantly associated with higher completion rates and faster convergence to correct solutions. Test case feedback, by contrast, exhibits heterogeneous effects that depend critically on feedback validity. Our results suggest that the form of AI-generated feedback matters, and that evaluating feedback quality -- not just its presence -- is essential for understanding its pedagogical impact.
翻译:大型语言模型(LLMs)正越来越多地被用于为入门级编程课程提供自动化反馈,然而,来自真实课堂部署中比较不同反馈模式的实证证据仍然有限。在本文中,我们提出了一项大规模课堂研究,其中通过随机协议在Python编程入门课程中部署了AI生成的反馈。学生在提交错误答案时会收到三种反馈条件之一:自然语言提示、AI生成的失败测试用例,或无AI反馈。我们发布了由此产生的数据集ProgFeed,该数据集记录了来自215名同意参加的学生在17个实验室中的6,693次提交,包括反馈条件、基于执行的性能指标以及细粒度的时间信息。利用这些数据,我们分析了学习轨迹、反馈质量以及重复尝试中的提交行为。我们发现,自然语言反馈与更高的完成率和更快的正确解收敛速度显著相关。相比之下,测试用例反馈呈现出异质性效应,这种效应关键取决于反馈的有效性。我们的结果表明,AI生成反馈的形式很重要,并且评估反馈质量(而不仅仅是其存在性)对于理解其教学影响至关重要。