Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning

Eason Chen,Sophia Judicke,Kayla Beigh,Xinyi Tang,Isabel Wang,Nina Yuan,Zimo Xiao,Chuangji Li,Shizhuo Li,Reed Luttmer,Shreya Singh,Maria Yampolsky,Naman Parikh,Yvonne Zhao,Meiyi Chen,Scarlett Huang,Anishka Mohanty,Gregory Johnson,John Mackey,Jionghao Lin,Ken Koedinger

from arxiv, 15 pages, 4 figures, accepted at AIED 2025

We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and self-efficacy, higher chatbot usage and answer-seeking behavior were negatively associated with subsequent midterm performance, whereas proof-review usage showed no detectable independent association. Together, the findings suggest that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.

翻译：我们评估了GPTutor——一个面向本科离散数学课程的大语言模型驱动辅导系统。该系统整合了两项大语言模型支持工具：一个对学生书面证明尝试提供嵌入式反馈的结构化证明审阅工具，以及一个用于数学问题咨询的聊天机器人。在一项包含148名学生的阶梯式访问研究中，早期访问权限与实验组独占系统使用期间更高的作业表现相关，但我们未观察到这种表现提升能迁移至考试成绩。使用日志显示，自我效能感和先前考试成绩较低的学生更频繁地使用两个组件。通过人工编码生成并利用自动分类器扩展的会话级行为标签，刻画了学生使用聊天机器人的行为模式（例如答案寻求型或帮助寻求型）。在控制先前表现和自我效能感的模型中，较高的聊天机器人使用频率和答案寻求行为与后续期中考试成绩呈负相关，而证明审阅工具的使用则未显示出可检测的独立关联。综合来看，研究结果表明：仅依赖聊天机器人支持可能无法可靠促进数学证明学习成果向独立评估的迁移，而以作业任务为锚点、结构化的反馈则较少显示出与学习成效降低的关联。