FerKD: Surgical Label Adaptation for Efficient Distillation

We present FerKD, a novel efficient knowledge distillation framework that incorporates partial soft-hard label adaptation coupled with a region-calibration mechanism. Our approach stems from the observation and intuition that standard data augmentations, such as RandomResizedCrop, tend to transform inputs into diverse conditions: easy positives, hard positives, or hard negatives. In traditional distillation frameworks, these transformed samples are utilized equally through their predictive probabilities derived from pretrained teacher models. However, merely relying on prediction values from a pretrained teacher, a common practice in prior studies, neglects the reliability of these soft label predictions. To address this, we propose a new scheme that calibrates the less-confident regions to be the context using softened hard groundtruth labels. Our approach involves the processes of hard regions mining + calibration. We demonstrate empirically that this method can dramatically improve the convergence speed and final accuracy. Additionally, we find that a consistent mixing strategy can stabilize the distributions of soft supervision, taking advantage of the soft labels. As a result, we introduce a stabilized SelfMix augmentation that weakens the variation of the mixed images and corresponding soft labels through mixing similar regions within the same image. FerKD is an intuitive and well-designed learning system that eliminates several heuristics and hyperparameters in former FKD solution. More importantly, it achieves remarkable improvement on ImageNet-1K and downstream tasks. For instance, FerKD achieves 81.2% on ImageNet-1K with ResNet-50, outperforming FKD and FunMatch by remarkable margins. Leveraging better pre-trained weights and larger architectures, our finetuned ViT-G14 even achieves 89.9%. Our code is available at https://github.com/szq0214/FKD/tree/main/FerKD.

翻译：我们提出FerKD——一种融合部分软硬标签适配与区域校准机制的新型高效知识蒸馏框架。该方法的灵感源于以下观察与直觉：标准数据增强（如随机裁剪）会将输入转化为多样化状态——易正例、难正例或难负例。传统蒸馏框架通过教师模型预测概率同等利用这些变换样本，但仅依赖教师预测值（这是前人研究的常见做法）会忽视软标签预测的可靠性。为此，我们提出新方案：将低置信区域校准为上下文，采用经软化的硬真实标签进行校准。该方案包含难区域挖掘+校准两个流程。实验证明，该方法可显著提升收敛速度与最终精度。此外，我们发现一致性混合策略能稳定软监督分布，从而充分发挥软标签优势。基于此，我们引入稳定的SelfMix增强——通过混合同一图像中的相似区域，削弱混合图像及对应软标签的变异性。FerKD作为一套直观且设计精良的学习系统，消除了前身FKD解决方案中的若干启发式策略与超参数。更重要的是，它在ImageNet-1K及下游任务上取得显著提升：例如基于ResNet-50实现81.2%的ImageNet-1K精度，大幅超越FKD与FunMatch；借助更优预训练权重与更大架构，微调版ViT-G14甚至达到89.9%。我们的代码已开源至https://github.com/szq0214/FKD/tree/main/FerKD。