Towards the Law of Capacity Gap in Distilling Language Models

Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the \textit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. However, the curse of capacity gap can not be tackled without notable compute overhead, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is an impossible triangle to distill an expected student from an optimal teacher student with small compute overhead. Fortunately, the impossible triangle can fortunately be possible provided an inducted \textit{law} of capacity gap. In this paper, we take the spirits of scaling law and reveal that the optimal teacher scale almost consistently follows a linear scaling with the student scale across different model architectures and data scales. The law later guides us to distil a 3B student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.

翻译：语言模型（LM）蒸馏是一个热门研究领域，旨在将大型教师语言模型中的知识提炼到小型学生模型中。尽管已有多种方法被提出以最大化蒸馏效果，但当教师与学生语言模型之间存在显著容量差距时，仍面临重大挑战。这一问题常被称为容量差距的“诅咒”，表明较大的教师模型未必能比从较小教师模型蒸馏出的学生模型产生更优的学生模型。换言之，在教师模型的规模扩展过程中，很可能存在一个能产生最佳学生模型的最优教师规模。然而，如先前研究所示，若不引入显著的计算开销，容量差距的诅咒便难以解决。在大型语言模型（LLMs）的背景下，先前可行的方法变得意义有限，因为从最优教师模型以较低计算开销蒸馏出预期学生模型构成了一个“不可能三角”。幸运的是，若能归纳出容量差距的“定律”，这一不可能三角便可成为可能。本文借鉴缩放定律的思想，揭示了最优教师规模在不同模型架构和数据规模下，几乎始终与学生规模呈线性缩放关系。该定律进一步指导我们从LLaMA2-7B蒸馏出一个30亿参数的学生模型（称为\textsc{MiniMA}）。实验表明，\textsc{MiniMA}的性能优于一系列30亿参数的竞争模型，甚至可与多个70亿参数模型相媲美。