Knowledge Distillation (KD) and mixup have proven effective at inducing smoothness in class boundaries; KD captures inherent class relationships in probability distributions, and mixup enforces them through convex combinations of inputs. Their interaction, however, remains poorly understood, particularly when mixup is applied only during student training. In this setting, the teacher is queried on inputs drawn from a vicinal distribution it never saw during training, a controlled mismatch whose effect on knowledge transfer has not been characterised. We show that this mismatch causes the teacher's supervisory signal to be dominated by distributional confusion rather than inter-class structure. Despite it, the student does not merely imitate the teacher: it independently acquires greater linearity in the vicinal region, a structural property that the teacher lacks, and goes beyond dark-knowledge transfer. KD with mixup consistently improves student accuracy and reduces overconfidence by an order of magnitude relative to the baseline, across CIFAR and ImageNet with varying-capacity teachers. Crucially, calibration propagates from teacher to student independently of accuracy transfer, and temperature scaling governs a measurable accuracy-calibration trade-off that becomes more pronounced under vicinal training. These results reframe mixup distillation not as a degraded version of standard KD, but as a richer transfer channel that simultaneously shapes discriminative performance, uncertainty estimation, and representational geometry.
翻译:知识蒸馏与混合方法已被证明能有效诱导类别边界的平滑性:知识蒸馏通过捕捉概率分布中的固有类别关系,而混合方法则通过输入的凸组合强化这些关系。然而,它们之间的相互作用仍未得到充分理解,特别是当混合方法仅应用于学生训练时。在此设置下,教师模型被查询的输入来自训练中从未见过的邻域分布,这种受控不匹配对知识迁移的影响此前未被刻画。我们证明,这种不匹配导致教师的监督信号被分布混淆而非类间结构所主导。尽管如此,学生并非简单模仿教师:它独立地在邻域区域获得了更强的线性特征——这是教师模型所缺乏的结构性质——并超越了暗知识迁移。与基线相比,结合混合的知识蒸馏能持续提升学生准确率,并将过度自信降低一个数量级,这在CIFAR和ImageNet数据集上使用不同容量教师模型时均成立。关键的是,校准性能从教师向学生的传播独立于准确率迁移,而温度缩放则控制着一个可量化的准确率-校准权衡,该权衡在邻域训练下更为显著。这些结果将混合蒸馏重新定义为一种更丰富的迁移通道——它同时塑造判别性能、不确定性估计和表示几何结构——而非标准知识蒸馏的退化版本。