Large Language Models (LLMs) are known for their performance, but we uncover a significant structural inefficiency: a phenomenon we term attention collapse. In many pre-trained decoder-style LLMs, the attention matrices in deeper layers degenerate, collapsing to near rank-one structures. These underutilized layers, which we call lazy layers, are redundant and impair model efficiency. To address this, we introduce Inheritune, a simple yet powerful training recipe designed to build smaller, stronger language models. Inheritune initializes a compact model by inheriting the potent early layers from a larger pre-trained model and then progressively trains and expands it. Our experiments on various models, including the GPT-2 family, demonstrate that models trained with Inheritune can match or even surpass the performance of their larger counterparts, despite having significantly fewer layers. This work presents a novel path toward model compression by design, enabling the creation of compact, yet highly performant language models. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune.
翻译:大型语言模型(LLM)以其卓越性能而闻名,但我们发现其存在显著的结构性低效现象:一种我们称之为注意力坍缩的现象。在许多预训练的类解码器LLM中,深层注意力矩阵会发生退化,坍缩为近似秩一结构。这些未被充分利用的层——我们称之为惰性层——是冗余的,并损害了模型效率。为解决此问题,我们提出了Inheritune,一种简单而强大的训练方案,旨在构建更小更强的语言模型。Inheritune通过继承更大预训练模型中强力的早期层来初始化一个紧凑模型,然后对其进行渐进式训练与扩展。我们在包括GPT-2系列在内的多种模型上的实验表明,采用Inheritune训练的模型尽管层数显著减少,但其性能能够匹配甚至超越其更大规模的对应模型。这项工作为通过设计实现模型压缩提供了一条新路径,使得创建紧凑且高性能的语言模型成为可能。代码发布于https://github.com/sanyalsunny111/LLM-Inheritune。