Recent studies have uncovered that language model distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is of vital importance to bring the knowledge from the teacher to the student. However, existing teacher assistant-based methods require maximally many trials before scheduling an optimal teacher assistant. To this end, we propose a minimal distillation schedule (MiniDisc) for scheduling the optimal teacher assistant in minimally one trial. In particular, motivated by the finding that the performance of the student is positively correlated to the scale-performance tradeoff of the teacher assistant, MiniDisc is designed with a $\lambda$-tradeoff to measure the optimality of the teacher assistant without trial distillation to the student. MiniDisc then can schedule the optimal teacher assistant with the best $\lambda$-tradeoff in a sandwich framework. MiniDisc is evaluated with an extensive set of experiments on GLUE. Experimental results demonstrate the improved efficiency our MiniDisc compared to several state-of-the-art baselines. We further apply MiniDisc to a language model with billions of parameters and show its scalability.
翻译:近期研究发现,当教师模型与学生模型之间存在较大容量差距时,语言模型蒸馏的效果会显著降低,为此学界提出了基于教师助手的蒸馏方法以弥合这一差距。作为知识传递的桥梁,教师助手的规模与性能对于实现知识从教师模型向学生模型的有效迁移至关重要。然而,现有基于教师助手的方法通常需要经过大量尝试才能确定最优的教师助手调度方案。针对这一问题,我们提出了一种最小蒸馏调度(MiniDisc)方法,能够在最少一次试验中实现最优教师助手的调度。具体而言,受学生模型性能与教师助手的规模-性能权衡存在正相关关系这一发现的启发,MiniDisc 设计了一种 λ-权衡指标,无需对教师助手进行试蒸馏即可评估其最优性。通过在三明治框架中选择具有最优 λ-权衡参数的教师助手,MiniDisc 能够实现最优调度。我们在 GLUE 基准上开展了广泛的实验评估,结果表明相较于多个现有最优基线方法,MiniDisc 显著提升了效率。进一步地,我们将 MiniDisc 应用于具有数十亿参数的语言模型,验证了其良好的可扩展性。