Large language models have become a vital component in modern NLP, achieving state of the art performance in a variety of tasks. However, they are often inefficient for real-world deployment due to their expensive inference costs. Knowledge distillation is a promising technique to improve their efficiency while retaining most of their effectiveness. In this paper, we reproduce, compare and analyze several representative methods for task-agnostic (general-purpose) distillation of Transformer language models. Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2. Through our extensive experiments, we study the effectiveness of each method for various student architectures in both monolingual (English) and multilingual settings. Overall, we show that MHA transfer based on MiniLMv2 is generally the best option for distillation and explain the potential reasons behind its success. Moreover, we show that HS transfer remains as a competitive baseline, especially under a sophisticated layer mapping strategy, while OD transfer consistently lags behind other approaches. Findings from this study helped us deploy efficient yet effective student models for latency-critical applications.
翻译:大型语言模型已成为现代自然语言处理(NLP)中的关键组成部分,在多种任务中取得了最先进的性能。然而,由于其高昂的推理成本,它们在实际部署中往往效率低下。知识蒸馏是一种有前景的技术,可以在保持模型大部分有效性的同时提高其效率。本文复现、比较并分析了多种具有代表性的Transformer语言模型任务无关(通用型)蒸馏方法。我们的研究对象包括输出分布(OD)迁移、采用不同层映射策略的隐藏状态(HS)迁移,以及基于MiniLMv2的多头注意力(MHA)迁移。通过大量实验,我们研究了每种方法在不同学生架构下(既包括单语(英语)也包括多语言环境)的有效性。总体而言,我们的研究表明基于MiniLMv2的MHA迁移通常是蒸馏的最佳选择,并解释了其成功背后的潜在原因。此外,我们证明HS迁移仍然是一个有竞争力的基线方法,尤其是在采用精密的层映射策略时,而OD迁移则始终落后于其他方法。本研究的发现帮助我们部署了高效且有效的学生模型,适用于对延迟敏感的实际应用场景。