Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark.
翻译:基于Transformer的语音自监督学习(SSL)模型(如HuBERT)在各种语音处理任务中表现出令人惊讶的性能。然而,语音SSL模型中庞大的参数量迫使研究者将其压缩为更紧凑的模型,以便在学术界或小型企业中广泛使用。在本研究中,我们提出跨Transformer层重用注意力图,从而在保留层数的同时去除键和查询参数。此外,我们提出一种新颖的掩码蒸馏策略,以提升学生模型的语音表示质量。我们将蒸馏损失扩展为同时利用掩码和非掩码语音帧,从而充分利用教师模型的高质量表示。我们的通用压缩策略使学生模型在SUPERB基准测试中实现了7.72%的音素错误率(PER)和9.96%的词错误率(WER)。