Maximizing Discrimination Capability of Knowledge Distillation with Energy-based Score

from arxiv, 22 pages, 4 figures. This work has been submitted to the Elsevier for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

To apply the latest computer vision techniques that require a large computational cost in real industrial applications, knowledge distillation methods (KDs) are essential. Existing logit-based KDs apply the constant temperature scaling to all samples in dataset, limiting the utilization of knowledge inherent in each sample individually. In our approach, we classify the dataset into two categories (i.e., low energy and high energy samples) based on their energy score. Through experiments, we have confirmed that low energy samples exhibit high confidence scores, indicating certain predictions, while high energy samples yield low confidence scores, meaning uncertain predictions. To distill optimal knowledge by adjusting non-target class predictions, we apply a higher temperature to low energy samples to create smoother distributions and a lower temperature to high energy samples to achieve sharper distributions. When compared to previous logit-based and feature-based methods, our energy-based KD (Energy KD) achieves better performance on various datasets. Especially, Energy KD shows significant improvements on CIFAR-100-LT and ImageNet datasets, which contain many challenging samples. Furthermore, we propose high energy-based data augmentation (HE-DA) for further improving the performance. We demonstrate that meaningful performance improvement could be achieved by augmenting only 20-50% of dataset, suggesting that it can be employed on resource-limited devices. To the best of our knowledge, this paper represents the first attempt to make use of energy scores in KD and DA, and we believe it will greatly contribute to future research.

翻译：为将需要巨大计算成本的最新技术应用于实际工业场景，知识蒸馏方法必不可少。现有基于logit的知识蒸馏对所有数据集样本采用恒定温度缩放，限制了每个样本固有知识的利用。本研究根据能量分数将数据集分为两类（低能量样本与高能量样本）。实验证实，低能量样本具有高置信度分数，表明预测结果明确；而高能量样本呈现低置信度分数，对应预测不确定性。为通过调整非目标类预测实现最优知识蒸馏，我们对低能量样本采用更高温度以产生更平滑的分布，对高能量样本采用更低温度以获得更尖锐的分布。与现有基于logit和基于特征的方法相比，基于能量的知识蒸馏（Energy KD）在多个数据集上取得更优性能。特别在包含大量挑战性样本的CIFAR-100-LT和ImageNet数据集上，Energy KD展现出显著提升。我们进一步提出基于高能量的数据增强（HE-DA）用于性能提升。实验证明，仅需对20-50%的数据集进行增强即可获得有意义的性能改进，表明该方法可部署于资源受限设备。据我们所知，本文首次尝试在知识蒸馏和数据增强中应用能量分数，相信将对未来研究产生重要贡献。