Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We analyze a dataset of math remediation dialogues in which expert tutors, novice tutors, and seven LLMs of varying sizes, comprising both open-weight and commercial models, respond to the same student errors. We examine instructional strategies and linguistic characteristics of tutoring responses, including uptake (restating and revoicing), pressing for accuracy and reasoning, lexical diversity, readability, politeness, and agency. We find that expert tutors produce higher-quality responses than novices, and that larger LLMs generally receive higher pedagogical quality ratings than smaller models, approaching expert performance on average. However, LLMs exhibit systematic differences in their instructional profiles: they underuse discursive strategies characteristic of expert tutors while generating longer, more lexically diverse, and more polite responses. Regression analyses show that pressing for accuracy and reasoning, restating and revoicing, and lexical diversity, are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. These findings highlight the importance of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.
翻译:近期研究探索了使用大型语言模型(LLMs)生成数学辅导回应,但其教学行为与专家人类实践的吻合程度尚不明确。我们分析了一个数学辅导对话数据集,其中专家导师、新手导师以及七个不同规模的LLMs(包括开源模型和商业模型)对同一批学生错误作出回应。我们考察了辅导回应的教学策略与语言特征,包括吸收性反馈(重述与转述)、准确性追问与推理追问、词汇多样性、可读性、礼貌性以及主体性表达。研究发现,专家导师生成的回应质量高于新手导师,且较大规模的LLMs通常获得比小模型更高的教学质量评分,其平均表现已接近专家水平。然而,LLMs在教学特征上表现出系统性差异:它们较少使用专家导师特有的对话策略,同时生成的回应更长、词汇多样性更高、且更注重礼貌表达。回归分析表明,准确性追问与推理追问、重述与转述以及词汇多样性均与感知教学质量呈正相关,而较高程度的主体性语言和礼貌表达则与教学质量呈负相关。这些发现凸显了在评估人类导师与智能辅导系统的辅导回应时,分析教学策略与语言特征的重要性。