Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline, significantly outperforming top-K and random sampling in diverse settings. These findings suggest that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher.
翻译:通过知识蒸馏(KD)从交叉编码器教师模型中迁移知识已成为训练检索模型的标准范式。现有研究主要集中在挖掘困难负样本以提升判别能力,但对训练数据的系统性构成及其产生的教师分数分布关注相对较少。本工作强调,仅关注困难负样本会阻碍学生学习教师的完整偏好结构,可能损害泛化能力。为有效模拟教师分数分布,我们提出一种分层采样策略,该策略均匀覆盖整个分数谱系。在领域内和跨领域基准上的实验证实,保持教师分数方差与熵的分层采样作为稳健基线,在多种设置下显著优于top-K采样与随机采样。这些发现表明,蒸馏的本质在于保留教师所感知的多样化相对分数范围。