We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages. Our approach maps source and target language speech signals into automatically discovered, discrete units and reformulates the problem as unsupervised unit-to-unit machine translation. We develop a three-step training procedure that involves (a) pre-training an unit-based encoder-decoder language model with a denoising objective (b) training it with word-by-word translated utterance pairs created by aligning monolingual text embedding spaces and (c) running unsupervised backtranslation bootstrapping off of the initial translation model. Our approach avoids mapping the speech signal into text and uses speech-to-unit and unit-to-speech models instead of automatic speech recognition and text to speech models. We evaluate our model on synthetic-speaker Europarl-ST English-German and German-English evaluation sets, finding that unit-based translation is feasible under this constrained scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to German.
翻译:我们提出一种无监督的语音到语音翻译(S2ST)系统,该系统不依赖于源语言和目标语言之间的平行数据。该方法将源语言和目标语言的语音信号映射为自动发现的离散单元,并将该问题转化为无监督的单元到单元机器翻译。我们开发了一个三阶段训练流程,包括:(a)使用去噪目标预训练基于单元的编码器-解码器语言模型;(b)通过对齐单语文本嵌入空间创建逐词翻译的语句对,并用其训练模型;(c)基于初始翻译模型运行无监督反向翻译自举。我们的方法避免将语音信号映射为文本,转而采用语音到单元和单元到语音模型,替代自动语音识别和文本到语音模型。我们在合成说话人的Europarl-ST英语-德语和德语-英语评估集上进行评估,发现单元翻译在此受限场景下可行,德语到英语的ASR-BLEU得分为9.29,英语到德语为8.07。