何种推理轨迹能更好地指导学生推理？一种简单的信息对齐度量方法 (Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment)

Yuming Yang,Mingyoung Lai,Wanxu Zhao,Xiaoran Fan,Zhiheng Xi,Mingqi Wu,Chiyue Huang,Jun Zhao,Haijun Lv,Jian Tong,Yunhua Zhou,Yicheng Zou,Qipeng Guo,Tao Gui,Qi Zhang,Xuanjing Huang

from arxiv, 27 pages. Project page: https://github.com/UmeanNever/RankSurprisalRatio

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

翻译：长链思维（CoT）轨迹为从教师到学生大语言模型（LLM）的知识蒸馏提供了丰富的监督信号。然而，先前研究及我们的实验均表明，来自更强教师的轨迹未必能培养出更优的学生，这凸显了数据-学生适配性在蒸馏过程中的重要性。现有方法主要通过学生似然性评估适配性，倾向于选择与学生模型当前行为高度对齐的轨迹，却忽略了更具信息量的轨迹。针对此问题，我们提出排序-惊奇比（RSR），一种同时捕捉对齐性与信息量的简单度量，用于评估推理轨迹的适配性。RSR的提出基于以下观察：有效的轨迹通常通过结合较低绝对概率与学生模型中相对较高排名的词元，来平衡学习信号强度与行为对齐。具体而言，RSR定义为轨迹的平均词元级排序与其平均负对数似然之比，其计算与解释均十分直观。在五个学生模型和来自11位不同教师的推理轨迹上，RSR与训练后推理性能表现出强相关性（平均斯皮尔曼系数0.86），持续优于现有度量方法。我们进一步展示了其在轨迹选择和教师选择两个场景中的实际应用价值。