There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.
翻译:已有大量尝试将基于二次复杂度注意力机制的大语言模型(LLM)蒸馏至亚二次复杂度的线性化架构中。然而,尽管进行了广泛研究,此类蒸馏模型在下游任务中往往难以达到其教师LLM的性能水平。我们设定了无损蒸馏的目标,并根据学生模型与教师模型在任务集上的容差校正胜平率来定义该目标。为此,我们提出了一种面向xLSTM学生模型的高效蒸馏流程。我们引入了一个额外的融合阶段,将各自线性化的专家模型合并为单一模型。通过蒸馏来自Llama、Qwen和Olmo系列的基础模型与指令微调模型,我们验证了该流程的有效性。在多种设定下,我们基于xLSTM的学生模型恢复了教师模型的大部分性能,甚至在某些下游任务上实现了超越。我们的贡献是朝着构建更节能、更具成本效益的Transformer基LLM替代方案迈出的重要一步。