Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.
翻译:Kullback-Leibler散度在大语言模型的知识蒸馏中已被广泛用于模型压缩。与先前认为反向Kullback-Leibler散度具有模式寻求特性因而优于均值寻求特性的前向Kullback-Leibler散度的论断相反,本研究通过实证与理论分析表明,在大语言模型的知识蒸馏中,模式寻求与均值寻求特性均未显现。相反,研究发现反向与前向Kullback-Leibler散度共享相同的优化目标,并在足够训练周期后收敛至相同状态。然而,由于实际训练限制,大语言模型很少经历如此充分的训练周期。同时,我们进一步发现,在训练初期,反向Kullback-Leibler散度关注分布尾部,而前向Kullback-Leibler散度关注分布头部。基于此,我们提出了一种简单而有效的自适应Kullback-Leibler散度方法,该方法通过自适应权重分配结合前向与反向Kullback-Leibler散度。基于指标与GPT-4的评估表明,所提出的自适应Kullback-Leibler散度方法在多项任务中优于基线模型,并提升了生成响应的多样性与质量。