In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and observe that it is equivalent to the Doupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels. From our analysis of the DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of DKL in scenarios like knowledge distillation by breaking its asymmetry property in training optimization. This modification ensures that the wMSE component is always effective during training, providing extra constructive cues. Secondly, we introduce global information into DKL for intra-class consistency regularization. With these two enhancements, we derive the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training and knowledge distillation tasks. The proposed approach achieves new state-of-the-art performance on both tasks, demonstrating the substantial practical merits. Code and models will be available soon at https://github.com/jiequancui/DKL.
翻译:本文深入探讨了Kullback-Leibler(KL)散度损失,发现其等价于解耦Kullback-Leibler(DKL)散度损失,该损失由1)加权均方误差(wMSE)损失和2)结合软标签的交叉熵损失组成。通过对DKL损失的分析,我们识别出两个可改进的方面。首先,针对知识蒸馏等场景中DKL的局限性,我们通过打破其在训练优化中的非对称性来加以解决。这一修改确保wMSE分量在训练过程中始终有效,从而提供额外的建设性线索。其次,我们将全局信息引入DKL以实现类内一致性正则化。基于这两项改进,我们导出了改进型Kullback-Leibler(IKL)散度损失,并在CIFAR-10/100和ImageNet数据集上开展实验,评估其在对抗训练与知识蒸馏任务中的有效性。所提方法在这两项任务上均取得了新的最优性能,充分体现了其实用价值。代码与模型即将发布于https://github.com/jiequancui/DKL。