Advances in large language models (LLMs) enable many new innovations in education. However, evaluating the effectiveness of new technology requires real students, which is time-consuming and hard to scale up. Therefore, many recent works on LLM-powered tutoring solutions have used simulated students for both training and evaluation, often via simple prompting. Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. We experiment on a real-world math tutoring dialogue dataset, where both automated and human evaluation results show that prompting strategies for student simulation perform poorly; supervised fine-tuning and preference optimization yield much better but still limited performance, motivating future work on this challenging task.
翻译:大型语言模型(LLM)的进步为教育领域带来了许多新的创新。然而,评估新技术的有效性需要真实学生的参与,这一过程耗时且难以规模化。因此,近期许多基于LLM的辅导解决方案在训练和评估中都使用了模拟学生,通常通过简单的提示实现。令人惊讶的是,目前很少有工作致力于确保甚至衡量模拟学生的质量。在本研究中,我们正式定义了学生模拟任务,提出了一套涵盖语言、行为和认知维度的评估指标,并基于这些指标对多种学生模拟方法进行了基准测试。我们在一个真实世界的数学辅导对话数据集上进行实验,自动评估和人工评估结果均表明:基于提示策略的学生模拟方法表现不佳;监督微调和偏好优化方法虽然取得了更好的性能,但仍然有限,这激励了未来在这一挑战性任务上的进一步研究。