Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark.
翻译:基于Transformer的语音自监督学习(SSL)模型(如HuBERT)在多种语音处理任务中展现出惊人性能。然而,语音SSL模型庞大的参数量迫使人们将其压缩至更紧凑的模型,以便在学术界或小型企业中更广泛地应用。本研究提出跨Transformer层复用注意力图的方法,在保留层数的同时移除键和查询参数。此外,我们提出一种新颖的掩码蒸馏策略,以提升学生模型的语音表征质量。我们将蒸馏损失扩展至同时利用掩码和非掩码语音帧,从而充分挖掘教师模型的高质量表征能力。我们的通用压缩策略在SUPERB基准测试中,使学生模型达到7.72%的音素错误率(PER)和9.96%的词错误率(WER)。