This paper describes the results of the first shared task on the generation of teacher responses in educational dialogues. The goal of the task was to benchmark the ability of generative language models to act as AI teachers, replying to a student in a teacher-student dialogue. Eight teams participated in the competition hosted on CodaLab. They experimented with a wide variety of state-of-the-art models, including Alpaca, Bloom, DialoGPT, DistilGPT-2, Flan-T5, GPT-2, GPT-3, GPT- 4, LLaMA, OPT-2.7B, and T5-base. Their submissions were automatically scored using BERTScore and DialogRPT metrics, and the top three among them were further manually evaluated in terms of pedagogical ability based on Tack and Piech (2022). The NAISTeacher system, which ranked first in both automated and human evaluation, generated responses with GPT-3.5 using an ensemble of prompts and a DialogRPT-based ranking of responses for given dialogue contexts. Despite the promising achievements of the participating teams, the results also highlight the need for evaluation metrics better suited to educational contexts.
翻译:本文描述了首个教育对话中教师回应生成共享任务的结果。该任务旨在评估生成式语言模型作为AI教师的能力,对师生对话中的学生发言进行回应。八支团队参加了在CodaLab平台上举办的竞赛,他们尝试了多种先进模型,包括Alpaca、Bloom、DialoGPT、DistilGPT-2、Flan-T5、GPT-2、GPT-3、GPT-4、LLaMA、OPT-2.7B和T5-base。提交的模型结果通过BERTScore和DialogRPT指标自动评分,前三名进一步根据Tack与Piech(2022)提出的教学能力标准进行人工评估。排名第一的NAISTeacher系统在自动评估与人工评估中均表现最佳,该系统基于GPT-3.5,通过集成提示组合与基于DialogRPT的对话上下文回应排序生成回答。尽管参赛团队取得了显著成果,但结果也凸显了教育场景中亟需更适用的评估指标。