Maximizing Discrimination Capability of Knowledge Distillation with Energy Function

from arxiv, 33 pages, 7 figures. This work has been submitted to the Elsevier for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

To apply the latest computer vision techniques that require a large computational cost in real industrial applications, knowledge distillation methods (KDs) are essential. Existing logit-based KDs apply the constant temperature scaling to all samples in dataset, limiting the utilization of knowledge inherent in each sample individually. In our approach, we classify the dataset into two categories (i.e., low energy and high energy samples) based on their energy score. Through experiments, we have confirmed that low energy samples exhibit high confidence scores, indicating certain predictions, while high energy samples yield low confidence scores, meaning uncertain predictions. To distill optimal knowledge by adjusting non-target class predictions, we apply a higher temperature to low energy samples to create smoother distributions and a lower temperature to high energy samples to achieve sharper distributions. When compared to previous logit-based and feature-based methods, our energy-based KD (Energy KD) achieves better performance on various datasets. Especially, Energy KD shows significant improvements on CIFAR-100-LT and ImageNet datasets, which contain many challenging samples. Furthermore, we propose high energy-based data augmentation (HE-DA) for further improving the performance. We demonstrate that meaningful performance improvement could be achieved by augmenting only 20-50% of dataset, suggesting that it can be employed on resource-limited devices. To the best of our knowledge, this paper represents the first attempt to make use of energy function in knowledge distillation and data augmentation, and we believe it will greatly contribute to future research.

翻译：为在工业实际应用中部署需要高昂计算成本的最新计算机视觉技术，知识蒸馏方法是必不可少的。现有的基于logit的KD方法对所有数据集样本采用恒定温度缩放，限制了每个样本自身所蕴含知识的利用。在我们的方法中，根据能量得分将数据集分为两类（即低能量样本与高能量样本）。通过实验，我们证实低能量样本具有高置信度分数，表示确定的预测，而高能量样本产生低置信度分数，表示不确定的预测。为了通过调整非目标类预测来蒸馏最优知识，我们对低能量样本应用更高温度以获得更平滑的分布，对高能量样本应用更低温度以获得更尖锐的分布。与先前基于logit和基于特征的方法相比，基于能量的知识蒸馏（Energy KD）在多种数据集上取得了更优性能。尤其在包含大量困难样本的CIFAR-100-LT和ImageNet数据集上，Energy KD展现出显著提升。此外，我们提出基于高能量数据增强方法（HE-DA）以进一步提高性能。我们证明仅增强数据集20-50%即可实现有意义的性能提升，表明该方法可应用于资源受限设备。据我们所知，本文首次尝试将能量函数应用于知识蒸馏与数据增强，我们相信这将为未来研究做出重要贡献。