Towards the Law of Capacity Gap in Distilling Language Models

Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the \textit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. Even worse, the curse of capacity gap can not be lifted without additional compute, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is impossible to distill a large teacher to a good student without notably additional compute. However, the tale is not ever one-sided. It is always not late to acquire that using a large teacher is resource-demanding. Consequently, instead of sticking to lifting the curse, leaving the curse as is and using a small yet adequate teacher should be arguably fine. Even better, in this paper, we take the spirits of scaling law and reveal that the optimal teacher scale is almost consistently and linearly correlated to the student scale across different model architectures and data scales, fortunately turning the curse into a \textit{law} of capacity gap. The law later guides us to distil a 3B student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.

翻译：语言模型（LM）蒸馏是一个旨在将大型教师LM中的知识提炼到小型学生LM的热门领域。尽管已有多种方法被提出以最大化蒸馏效果，但重大挑战依然存在，尤其是在教师与学生LM之间存在显著容量差距时。这一问题常被称为容量差距的“诅咒”，意味着更大的教师未必能产生优于从较小教师蒸馏而来的学生模型。换言之，在教师模型的扩展过程中，很可能存在一个能产生最佳学生模型的最优教师规模。更糟糕的是，先前研究表明，若不增加计算资源，容量差距的诅咒便无法消除。在大语言模型（LLM）的背景下，先前可行的方法变得意义不大，因为若没有显著额外的计算，几乎不可能从大型教师蒸馏出优秀的学生模型。然而，事情并非总是单方面的。我们早已认识到使用大型教师是资源密集型的。因此，与其执着于消除诅咒，不如接受诅咒的存在，转而使用一个规模较小但足够胜任的教师，这应当是合理的。更佳的是，本文借鉴扩展定律的思想，揭示了在不同模型架构和数据规模下，最优教师规模几乎始终与学生规模呈线性相关，从而幸运地将诅咒转变为容量差距的“定律”。该定律随后指导我们从LLaMA2-7B蒸馏出一个30亿参数的学生LM（称为\textsc{MiniMA}）。实验表明，\textsc{MiniMA}的性能优于一系列30亿参数的竞争模型，甚至可与多个70亿参数的模型相媲美。