For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models' identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher's predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher's linearly recoverable concepts.
翻译:对于包含自回归语言模型在内的广泛判别模型族,可识别性结果表明:若两个模型诱导出相同的条件分布,则其内部表征在可逆线性变换的意义下一致。我们探讨当分布接近而非相等时,近似结论是否成立。基于Nielsen等人(2025)关于KL散度接近性未必蕴含高线性表征相似性的观察,我们研究了一种基于logit差异的分布距离,并证明该距离上的接近性确实能导出线性相似性保证。具体而言,我们基于模型的可识别性类定义了一种表征差异性度量,并证明其受logit距离上界约束。进一步表明,当模型概率远离零时,KL散度可上界logit距离;然而所得上界在实践中无法提供有效约束。因此,基于KL散度的知识蒸馏虽能匹配教师模型的预测,却可能无法保持线性表征特性(例如人类可解释概念的线性探针可恢复性)。在合成与图像数据集上的蒸馏实验中,基于logit距离的蒸馏能使学生模型获得更高的线性表征相似性,并更好地保持教师模型的线性可恢复概念。