Using Self-Supervised Learning (SSL) as model initialization is now common to obtain strong results in Speech Translation (ST). However, they also impose a large memory footprint, hindering on-device deployment. In this paper, we leverage the SSL models by pretraining smaller models on their Discrete Speech Units (DSU). We pretrain encoder-decoder models on 1) Filterbank-to-DSU and 2) DSU-to-Translation data, and take the encoder from 1) and the decoder from 2) to initialise a new model, finetuning this on limited speech-translation data. The final model becomes compact by using the DSU pretraining to distil the knowledge of the SSL model. Our method has several benefits over using DSU as model inputs, such as shorter inference pipeline and robustness over (DSU) tokenization. In contrast to ASR pretraining, it does not require transcripts, making it applicable to low-resource settings. Evaluation on CoVoST-2 X-En shows that our method is >$0.5$ BLEU better than a ST model that directly finetune the SSL model, given only half the model size, and on a par with ASR pretraining.
翻译:利用自监督学习(SSL)作为模型初始化方法,如今已成为在语音翻译(ST)领域获得优异结果的常见做法。然而,此类模型也带来了较大的内存占用,阻碍了其在设备端的部署。本文通过在小规模离散语音单元(DSU)上预训练模型来利用SSL模型。具体而言,我们分别在1)滤波器组到DSU和2)DSU到翻译数据上预训练编码器-解码器模型,将1)中的编码器和2)中的解码器组合初始化新模型,并在有限的语音-翻译数据上微调该模型。通过利用DSU预训练蒸馏SSL模型的知识,最终模型实现了紧凑化。与直接使用DSU作为模型输入相比,我们的方法具有推理流程更短、对DSU分词更具鲁棒性等优势。与ASR预训练不同,该方法无需转录文本,因此适用于低资源场景。在CoVoST-2 X-En上的评估表明,在模型规模仅为一半的情况下,我们的方法比直接微调SSL模型的语音翻译模型在BLEU值上高出>0.5,且与ASR预训练方法性能相当。