Existing speech-to-speech translation models fall into two camps: textless models trained with hundreds of hours of parallel speech data or unsupervised models that leverage text as an intermediate step. Both approaches limit building speech-to-speech translation models for a wide range of languages, as they exclude languages that are primarily spoken and language pairs that lack large-scale parallel speech data. We present a new framework for training textless low-resource speech-to-speech translation (S2ST) systems that only need dozens of hours of parallel speech data. We reformulate S2ST as a unit-to-unit seq2seq translation task, and start by pretraining a model on large-scale monolingual speech data. Then, we finetune it with a small amount of parallel speech data ($20-60$ hours). Lastly, we improve model performance through an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech data. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our supervised topline.
翻译:现有的语音到语音翻译模型分为两类:一类是基于无文本方法,需使用数百小时并行语音数据训练的模型;另一类是利用文本作为中间步骤的无监督模型。这两种方法均限制了为广泛语言构建语音到语音翻译模型的能力,因为它们排除了主要以口语形式存在且缺乏大规模并行语音数据的语言对。我们提出了一种新的框架,用于训练仅需数十小时并行语音数据的无文本低资源语音到语音翻译(S2ST)系统。我们将S2ST重新定义为单元到单元的序列到序列翻译任务,首先在大规模单语语音数据上预训练模型,然后使用少量并行语音数据(20-60小时)进行微调。最后,通过无监督反向翻译目标提升模型性能。我们使用单说话人合成语音数据,在三个不同领域(欧洲议会、Common Voice和全印广播电台)上训练并评估了英语到德语、德语到英语以及马拉地语到英语的翻译模型。采用ASR-BLEU指标评估,我们的模型在所有三个领域均取得了合理性能,其中部分结果与监督基线差距在1-2分以内。