Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks. Due to the high cost of developing these large models, building new encoders for new tasks and deploying them to on-device applications are infeasible. Prior studies propose model compression methods to address this issue, but those works focus on smaller models and less realistic tasks. Thus, we propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders by leveraging masked prediction and contrastive learning to train student models to copy the behavior of a large teacher model. CoLLD outperforms prior methods and closes the gap between small and large models on multilingual speech-to-text translation and recognition benchmarks.
翻译:大规模自监督预训练语音编码器在语音识别和翻译任务中优于传统方法。由于开发这些大型模型的成本高昂,为新型任务构建新编码器并将其部署到设备端应用难以实现。以往研究提出模型压缩方法解决此问题,但这些工作聚焦于较小模型及较不实际的场景。因此,我们提出对比层间蒸馏(CoLLD),一种通过利用掩码预测和对比学习训练学生模型模仿大型教师模型行为的新型知识蒸馏方法。CoLLD在性能上超越先前方法,并在多语言语音到文本翻译与识别基准测试中缩小了小型模型与大型模型之间的差距。