Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.
翻译:Kullback-Leibler(KL)散度已被广泛用于知识蒸馏(KD)以压缩大语言模型(LLMs)。与先前声称反向KL(RKL)散度具有模式搜索特性因而优于均值搜索型正向KL(FKL)散度的观点相反,本研究通过实验与理论证明,在LLMs的KD过程中,模式搜索与均值搜索特性均不显现。实际上,RKL与FKL共享相同的优化目标,并在足够多的训练轮次后均趋于收敛。然而,受限于实际约束,LLMs极少进行如此大量的轮次训练。此外,我们进一步发现,在初始训练轮次中,RKL关注分布尾部,而FKL关注分布头部。基于此,我们提出一种简单而有效的自适应KL(AKL)散度方法,该方法通过自适应分配权重来融合FKL与RKL。基于评估指标与GPT-4的评估结果表明,所提AKL方法在多种任务中均优于基线模型,并提升了生成响应的多样性与质量。