We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.
翻译:本文提出一种利用自监督语音模型构建更紧凑语音到文本翻译系统的预训练方法。与将SSS模型用于初始化的传统方案不同,本方法更适用于内存受限场景(如设备端部署)。该方法基于从SSS模型中提取的离散语音单元:首先,分别在1)滤波器组到DSU和2)DSU到翻译文本的配对数据上预训练两个小型编码器-解码器模型,使DSU成为小模型的蒸馏输入;随后,将滤波器组-DSU模型的编码器与DSU-翻译模型的解码器组合初始化紧凑模型;最后,在滤波器组-翻译配对数据上对紧凑模型进行微调。该方法不仅模型紧凑,且无需转录文本,适用于低资源场景,同时在推理阶段无需语音离散化处理,对DSU分词具有更强鲁棒性。在CoVoST-2(X-En)数据集上的评估表明,本方法在三个指标上均较基线模型获得稳定提升,且模型尺寸仅为SSS模型的一半。