Designing dialog tutors has been challenging as it involves modeling the diverse and complex pedagogical strategies employed by human tutors. Although there have been significant recent advances in neural conversational systems using large language models (LLMs) and growth in available dialog corpora, dialog tutoring has largely remained unaffected by these advances. In this paper, we rigorously analyze various generative language models on two dialog tutoring datasets for language learning using automatic and human evaluations to understand the new opportunities brought by these advances as well as the challenges we must overcome to build models that would be usable in real educational settings. We find that although current approaches can model tutoring in constrained learning scenarios when the number of concepts to be taught and possible teacher strategies are small, they perform poorly in less constrained scenarios. Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring, which measures learning opportunities for students and how engaging the dialog is. To understand the behavior of our models in a real tutoring setting, we conduct a user study using expert annotators and find a significantly large number of model reasoning errors in 45% of conversations. Finally, we connect our findings to outline future work.
翻译:设计对话辅导系统极具挑战性,因为它需要模拟人类教师所采用多样且复杂的教学策略。尽管近年来基于大型语言模型(LLMs)的神经对话系统取得了重大进展,且可用对话语料库不断增长,但对话辅导领域在很大程度上仍未受这些进展的影响。本文在两个语言学习的对话辅导数据集上,利用自动评估与人工评估系统地对多种生成式语言模型进行严格分析,旨在理解这些进步带来的新机遇,以及我们在构建可用于真实教育场景的模型时必须克服的挑战。研究发现,虽然当前方法能在受限的学习场景(如待教授概念数量及教师策略可能范围较小)中有效建模辅导过程,但在非受限场景下表现欠佳。我们的质量评估显示,在衡量学生参与对话机会及对话吸引力的“公平辅导”维度上,模型与人工标注的真实答案均表现低下。为探究模型在真实辅导场景中的行为,我们通过专家标注者开展用户研究,发现45%的对话中存在大量模型推理错误。最后,我们整合研究成果并为未来工作指明方向。