Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. Despite the practical benefits, task-agnostic distillation is challenging. Since the teacher model has a significantly larger capacity and stronger representation power than the student model, it is very difficult for the student to produce predictions that match the teacher's over a massive amount of open-domain training data. Such a large prediction discrepancy often diminishes the benefits of knowledge distillation. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. Specifically, we initialize the student model from the teacher model, and iteratively prune the student's neurons until the target width is reached. Such an approach maintains a small discrepancy between the teacher's and student's predictions throughout the distillation process, which ensures the effectiveness of knowledge transfer. Extensive experiments demonstrate that HomoDistil achieves significant improvements on existing baselines.
翻译:知识蒸馏已被证明是一种有效的模型压缩方法,有助于在实际应用中部署预训练语言模型。本文聚焦于任务无关蒸馏,该方法生成一个紧凑的预训练模型,可在各类任务上以较小的计算成本和内存占用轻松微调。尽管具有实际优势,任务无关蒸馏仍面临挑战。由于教师模型相比学生模型具有显著更大的容量和更强的表示能力,学生模型很难在大量开放域训练数据上生成与教师模型匹配的预测。这种巨大的预测差异往往削弱了知识蒸馏的收益。为解决这一挑战,我们提出同伦蒸馏(HomoDistil),一种配备迭代剪枝的新型任务无关蒸馏方法。具体而言,我们从教师模型初始化学生模型,并迭代剪枝学生模型的神经元直至达到目标宽度。该方法在整个蒸馏过程中始终保持教师模型与学生模型预测之间的微小差异,从而确保知识迁移的有效性。大量实验表明,HomoDistil在现有基线方法上取得了显著改进。