Unit-based Speech-to-Speech Translation Without Parallel Data

We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages. Our approach maps source and target language speech signals into automatically discovered, discrete units and reformulates the problem as unsupervised unit-to-unit machine translation. We develop a three-step training procedure that involves (a) pre-training an unit-based encoder-decoder language model with a denoising objective (b) training it with word-by-word translated utterance pairs created by aligning monolingual text embedding spaces and (c) running unsupervised backtranslation bootstrapping off of the initial translation model. Our approach avoids mapping the speech signal into text and uses speech-to-unit and unit-to-speech models instead of automatic speech recognition and text to speech models. We evaluate our model on synthetic-speaker Europarl-ST English-German and German-English evaluation sets, finding that unit-based translation is feasible under this constrained scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to German.

翻译：我们提出一种无监督的语音到语音翻译（S2ST）系统，该系统不依赖于源语言和目标语言之间的平行数据。该方法将源语言和目标语言的语音信号映射为自动发现的离散单元，并将该问题转化为无监督的单元到单元机器翻译。我们开发了一个三阶段训练流程，包括：（a）使用去噪目标预训练基于单元的编码器-解码器语言模型；（b）通过对齐单语文本嵌入空间创建逐词翻译的语句对，并用其训练模型；（c）基于初始翻译模型运行无监督反向翻译自举。我们的方法避免将语音信号映射为文本，转而采用语音到单元和单元到语音模型，替代自动语音识别和文本到语音模型。我们在合成说话人的Europarl-ST英语-德语和德语-英语评估集上进行评估，发现单元翻译在此受限场景下可行，德语到英语的ASR-BLEU得分为9.29，英语到德语为8.07。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/