Compact Speech Translation Models via Discrete Speech Units Pretraining

Using Self-Supervised Learning (SSL) as model initialization is now common to obtain strong results in Speech Translation (ST). However, they also impose a large memory footprint, hindering on-device deployment. In this paper, we leverage the SSL models by pretraining smaller models on their Discrete Speech Units (DSU). We pretrain encoder-decoder models on 1) Filterbank-to-DSU and 2) DSU-to-Translation data, and take the encoder from 1) and the decoder from 2) to initialise a new model, finetuning this on limited speech-translation data. The final model becomes compact by using the DSU pretraining to distil the knowledge of the SSL model. Our method has several benefits over using DSU as model inputs, such as shorter inference pipeline and robustness over (DSU) tokenization. In contrast to ASR pretraining, it does not require transcripts, making it applicable to low-resource settings. Evaluation on CoVoST-2 X-En shows that our method is >$0.5$ BLEU better than a ST model that directly finetune the SSL model, given only half the model size, and on a par with ASR pretraining.

翻译：利用自监督学习（SSL）作为模型初始化方法，如今已成为在语音翻译（ST）领域获得优异结果的常见做法。然而，此类模型也带来了较大的内存占用，阻碍了其在设备端的部署。本文通过在小规模离散语音单元（DSU）上预训练模型来利用SSL模型。具体而言，我们分别在1）滤波器组到DSU和2）DSU到翻译数据上预训练编码器-解码器模型，将1）中的编码器和2）中的解码器组合初始化新模型，并在有限的语音-翻译数据上微调该模型。通过利用DSU预训练蒸馏SSL模型的知识，最终模型实现了紧凑化。与直接使用DSU作为模型输入相比，我们的方法具有推理流程更短、对DSU分词更具鲁棒性等优势。与ASR预训练不同，该方法无需转录文本，因此适用于低资源场景。在CoVoST-2 X-En上的评估表明，在模型规模仅为一半的情况下，我们的方法比直接微调SSL模型的语音翻译模型在BLEU值上高出>0.5，且与ASR预训练方法性能相当。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日