A central pedagogical value evaluated in AI tutor benchmarks is scaffolding: guiding students through graduated steps toward a solution. Alignment and evaluation methods for embedding scaffolding behaviour into chatbots, however, rest on an implicit assumption: that students will take up the scaffolding and engage in the conversation. To examine whether this assumption holds, we introduce an evaluation pipeline around two metrics - Chatbot Scaffolding and Student Uptake - and apply them across nine datasets of 9,490 chats, spanning AI tutor benchmarks and real-world deployments of educational chatbots. Our analysis reveals that while benchmarks assume a high-scaffolding, high-student-uptake environment, students in real-world settings exhibit lower levels of uptake overall - frequently bypassing the chatbot's pedagogical framing to drive the interaction toward their own learning goals at little interpersonal cost. We argue that bypassing scaffolding is not necessarily detrimental; rather, it frequently highlights a mismatch between a chatbot's pedagogical framing and the student's learning goals. To meaningfully evaluate the effectiveness of a chatbot's assistance, future benchmarks must move beyond the assumption that students will simply take up the scaffolding, and instead evaluate how these chatbots navigate diverse learning contexts and student-driven interaction patterns.
翻译:AI辅导系统测评基准的核心教学价值在于支架式教学:通过渐进式步骤引导学生找到解决方案。然而,将支架式教学行为嵌入聊天机器人的对齐与评估方法,均隐含着这样一个假设:学生会接受支架式教学并参与对话。为验证这一假设是否成立,本文引入了一个包含“聊天机器人支架”与“学生接纳度”两个指标的评估流水线,并将其应用于涵盖AI辅导系统测评基准与教育聊天机器人实际部署场景的9个数据集、共9490次对话。分析显示,尽管测评基准假设存在高支架、高学生接纳度的环境,但实际场景中的学生整体接纳度较低——学生常常绕过聊天机器人的教学框架,以极小的社交成本驱动交互朝向自身学习目标。我们认为,绕过支架式教学未必有害,反而常常凸显聊天机器人教学框架与学生自身学习目标之间的错配。为有效评估聊天机器人的辅助效果,未来的测评基准必须摒弃学生会简单接受支架式教学的假设,转而评估这些聊天机器人如何适应多样化学习情境与学生驱动的交互模式。