Aligning large language models (LLMs) as math tutors typically demands costly reinforcement-learning (RL) training and external LLM judges. We ask whether a frozen model's internal reasoning signals can replace both. We propose the Tutoring Effectiveness Index (TEI), a training-free, judge-free four-signal index that combines a Schoenfeld-Verify keyword ratio, a math-step density, an ends-question rate, and a deep-reasoning gate from the Deep-Thinking Ratio (DTR) probe. Selecting from $N$ candidates with TEI (the TEI@$N$ rule) raises the improvement rate on pre-incorrect scenarios from $59.0\%$ to $81.9\%$ at $N{=}8$ on a frozen DeepSeek-R1-8B base, with no training and no external judge. We also measure the alignment tax of pedagogical GRPO. Thinking length drops from $1{,}764$ to $119$ words per turn ($-93\%$), Content-Knowledge and Pedagogical-Knowledge accuracy fall by $-71\%$ and $-80\%$ relative, and the student's $Δ$ Solve Rate crosses from $+0.180$ to $-0.012$. To anchor the behavioural reading, we reproduce an 82-code educational codebook on $119{,}009$ tutor sentences with a one-shot structural classifier. Together, these results offer a cost-effective recipe for building math-tutoring LLMs without RL training or external judges.
翻译:将大语言模型(LLM)对齐为数学导师通常需要昂贵的强化学习(RL)训练和外部LLM评判器。我们探究冻结模型的内在推理信号能否替代这两者。为此,我们提出辅导有效性指数(TEI),这是一种无需训练、无需评判器的四信号指数,结合了Schoenfeld-Verify关键词比率、数学步骤密度、结尾提问率以及基于深度思维比率(DTR)探测的深度推理门控。在冻结的DeepSeek-R1-8B基座上,采用TEI从$N$个候选者中筛选(TEI@$N$规则)可将预错误场景下的提升率从$59.0\%$提升至$N{=}8$时的$81.9\%$,且无需训练和外部评判器。我们还测量了教学GRPO的对齐代价:每轮思考长度从$1{,}764$个词降至$119$个词(降幅$93\%$),内容知识与教学知识准确率分别相对下降$71\%$和$80\%$,学生解题率变化从$+0.180$降至$-0.012$。为锚定行为解读,我们使用一次性结构分类器,在$119{,}009$条导师语句上复现了包含82个代码的教育编码本。综合而言,这些结果为构建无需RL训练或外部评判器的数学辅导LLM提供了一种经济高效的方案。