Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline, significantly outperforming top-K and random sampling in diverse settings. These findings suggest that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher.
翻译:通过知识蒸馏(Knowledge Distillation, KD)从交叉编码器教师模型中迁移知识已成为训练检索模型的标准范式。尽管现有研究主要集中于挖掘难负例以提升判别能力,但训练数据的系统性组成及由此产生的教师分数分布却相对较少受到关注。在本工作中,我们强调仅关注难负例会阻碍学生模型学习教师模型的完整偏好结构,从而可能损害泛化能力。为有效模拟教师分数分布,我们提出一种分层采样(Stratified Sampling)策略,该策略均匀覆盖整个分数谱系。在域内与跨域基准上的实验表明,保留教师分数方差与熵的分层采样作为鲁棒基线,能在多种设置下显著优于Top-K采样与随机采样。这些发现表明,蒸馏的本质在于保留教师所感知的多样化相对分数范围。