Towards the Law of Capacity Gap in Distilling Language Models

Language model (LM) distillation is a trending area that aims to distil the knowledge resided in a large teacher LM to a small student one. While various methods have been proposed to push the distillation to its limits, it is still a pain distilling LMs when a large capacity gap is exhibited between the teacher and the student LMs. The pain is mainly resulted by the curse of capacity gap, which describes that a larger teacher LM cannot always lead to a better student LM than one distilled from a smaller teacher LM due to the affect of capacity gap increment. That is, there is likely an optimal point yielding the best student LM along the scaling course of the teacher LM. Even worse, the curse of capacity gap can be only partly yet not fully lifted as indicated in previous studies. However, the tale is not ever one-sided. Although a larger teacher LM has better performance than a smaller teacher LM, it is much more resource-demanding especially in the context of recent large LMs (LLMs). Consequently, instead of sticking to lifting the curse, leaving the curse as is should be arguably fine. Even better, in this paper, we reveal that the optimal capacity gap is almost consistent across different student scales and architectures, fortunately turning the curse into the law of capacity gap. The law later guides us to distil a 3B student LM (termed MiniMA) from a 7B teacher LM (adapted LLaMA2-7B). MiniMA is demonstrated to yield a new compute-performance pareto frontier among existing 3B LMs on commonly used benchmarks, and its instruction-tuned version (termed MiniChat) outperforms a wide range of 3B competitors in GPT4 evaluation and could even compete with several 7B chat models.

翻译：语言模型（LM）蒸馏是一个热门领域，旨在将大型教师LM中的知识蒸馏至小型学生LM。尽管已有多种方法试图将蒸馏推向极限，但当教师与学生LM之间存在较大容量差距时，蒸馏仍面临难题。这一困境主要源于容量差距的诅咒：即随着容量差距的增加，较大的教师LM并不总能蒸馏出优于较小教师LM的学生LM。这意味着，在教师LM的规模扩展过程中，可能存在一个最优平衡点，使得学生LM性能最佳。更糟的是，先前研究表明容量差距的诅咒只能被部分而非完全解除。然而，情况并非全然悲观。尽管大型教师LM性能优于小型教师LM，但其资源消耗巨大，尤其是在近期大型语言模型（LLMs）的背景下。因此，与其执着于解除诅咒，不如接受诅咒的存在。更积极的是，本文揭示出：在不同学生规模与架构下，最优容量差距几乎保持恒定，从而将“诅咒”转化为“容量差距定律”。基于这一定律，我们成功从7B教师LM（adapted LLaMA2-7B）中蒸馏出3B学生LM（称为MiniMA）。实验表明，MiniMA在常用基准测试中为现有3B LM创造了新的计算-性能帕累托边界；其指令微调版本（称为MiniChat）在GPT4评估中优于众多3B竞品，甚至可与多个7B聊天模型一较高下。