Large Language Models (LLMs), such as ChatGPT, are quickly advancing AI to the frontiers of practical consumer use and leading industries to re-evaluate how they allocate resources for content production. Authoring of open educational resources and hint content within adaptive tutoring systems is labor intensive. Should LLMs like ChatGPT produce educational content on par with human-authored content, the implications would be significant for further scaling of computer tutoring system approaches. In this paper, we conduct the first learning gain evaluation of ChatGPT by comparing the efficacy of its hints with hints authored by human tutors with 77 participants across two algebra topic areas, Elementary Algebra and Intermediate Algebra. We find that 70% of hints produced by ChatGPT passed our manual quality checks and that both human and ChatGPT conditions produced positive learning gains. However, gains were only statistically significant for human tutor created hints. Learning gains from human-created hints were substantially and statistically significantly higher than ChatGPT hints in both topic areas, though ChatGPT participants in the Intermediate Algebra experiment were near ceiling and not even with the control at pre-test. We discuss the limitations of our study and suggest several future directions for the field. Problem and hint content used in the experiment is provided for replicability.
翻译:大型语言模型(LLMs,如ChatGPT)正迅速将人工智能推向实用消费领域的边界,并引领各行业重新评估其在内容生产中的资源分配方式。在自适应辅导系统中,开放教育资源和提示内容的创作高度依赖人力。如果ChatGPT等LLMs能产出与人类作者水平相当的教育内容,将对计算机辅导系统方法的进一步规模化产生深远影响。本文首次通过对比ChatGPT生成的提示与人类导师编写的提示在两项代数主题(基础代数和中级代数)上的教学效果,对77名参与者进行学习收益评估。研究发现,ChatGPT生成的提示中有70%通过了人工质量检查,且人工提示与ChatGPT提示均带来了正向学习收益。然而,仅在人类导师编写的提示条件下,学习收益具有统计学显著性。在两个代数主题领域中,人工提示的学习收益均显著高于ChatGPT提示,尽管中级代数实验中的ChatGPT参与者接近天花板效应,且其前测成绩未与对照组持平。我们讨论了研究的局限性,并为该领域提出了若干未来研究方向。实验中使用的题目与提示内容已公开,供研究复现使用。