Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.
翻译:大型语言模型正被越来越多地提出作为教育导师,然而更强的任务解决能力并不必然意味着更强的学习支持。受近期呼吁在实践中衡量NLP系统社会影响的启发,我们研究了公开的LLM辅导基准测试是否能够区分支持学习的行为与单纯的答案生成。我们提出了一种轻量级诊断方法,其基础是解题导向与教学导向基准表现之间的差距。利用公开的MathTutorBench排行榜结果,我们展示了这些维度仅存在部分对齐:在八个公开报告的模型中,解题与教学综合能力得分的相关性为0.421,且当评估从解题转向教学时,多个模型的排名发生了显著变化。随后,我们分析了公开的TutorBench样本,并指出与主体性相关的行为被明确编码在基准测试评分标准中,尤其是在主动学习情境下,奖励引导性问题、校准提示和避免透露答案的支架式教学。这些发现共同表明,教育影响评估不应将任务成功视为学习支持的充分代理指标。我们认为,公开的辅导基准测试可以通过分别报告解题导向和教学导向得分,并更明确地制定与透露信息相关的、维护学生主体性的评判标准,从而更好地支持正面影响评估。