Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We analyze a dataset of math remediation dialogues in which expert tutors, novice tutors, and seven LLMs of varying sizes, comprising both open-weight and commercial models, respond to the same student errors. We examine instructional strategies and linguistic characteristics of tutoring responses, including uptake (restating and revoicing), pressing for accuracy and reasoning, lexical diversity, readability, politeness, and agency. We find that expert tutors produce higher-quality responses than novices, and that larger LLMs generally receive higher pedagogical quality ratings than smaller models, approaching expert performance on average. However, LLMs exhibit systematic differences in their instructional profiles: they underuse discursive strategies characteristic of expert tutors while generating longer, more lexically diverse, and more polite responses. Regression analyses show that pressing for accuracy and reasoning, restating and revoicing, and lexical diversity, are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. These findings highlight the importance of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.
翻译:近期研究探索了使用大型语言模型生成数学辅导回应,但其教学行为与人类专家实践的契合程度仍不明确。我们分析了一组数学补救对话数据集,其中专家教师、新手教师以及七个不同规模(包含开源权重与商业模型)的大型语言模型对相同的学生错误进行回应。我们考察了辅导回应的教学策略与语言特征,包括承接(复述与重新表述)、准确性追问与推理、词汇多样性、可读性、礼貌性及能动性。研究发现,专家教师的回应质量高于新手教师,且较大的语言模型通常比较小的模型获得更高的教学评价,平均表现接近专家水平。然而,大型语言模型在教学特征上呈现系统性差异:它们较少采用专家教师典型的对话策略,同时生成更长、词汇更丰富且更礼貌的回应。回归分析表明,准确性追问与推理、复述与重新表述以及词汇多样性与感知教学正相关,而更高程度的能动性语言和礼貌性语言则呈负相关。这些发现凸显了在评估人类教师与智能辅导系统的回应质量时,分析教学策略与语言特征的重要性。