In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusion in the mathematical domain. We release MRBench -- a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 LLM as an evaluator and analyze each tutor's pedagogical abilities, highlighting which LLMs are good tutors and which ones are more suitable as question-answering systems. We believe that the presented taxonomy, benchmark, and human-annotated labels will streamline the evaluation process and help track the progress in AI tutors' development.
翻译:本文旨在探究当前最先进的大型语言模型(LLM)能否作为有效的AI导师,并检验其在教育对话中是否展现出优秀AI导师所需的教学能力。以往评估工作多局限于主观评价方案与基准测试。为弥补这一不足,我们基于核心学习科学原理,提出了包含八个教学维度的统一评估分类体系,该体系旨在以数学领域中学生错误或困惑为背景,评估LLM驱动AI导师回复的教学价值。我们发布了MRBench——一个包含192段对话和1,596条回复的新型评估基准,涵盖七种最先进的LLM导师与人类导师的交互数据,并为八个教学维度提供标准标注。我们评估了主流评估模型Prometheus2作为评估工具的可靠性,分析了各类导师的教学能力,明确指出哪些LLM适合担任导师角色,哪些更适用于问答系统。我们相信,所提出的分类体系、评估基准及人工标注数据将优化评估流程,助力AI导师研发进展的追踪。