This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets by combining masked autoencoder, unsupervised embedding mapping, and back-translation. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system, reporting $18.14$ BLEU points improvement on the synthesized Unpaired-Conversational dataset. In contrast to supervised approaches that necessitate real paired data, or specialized modeling to replicate para-/non-linguistic information such as pauses, speaking rates, and speaker identity, Translatotron 3 showcases its capability to retain it. Audio samples can be found at http://google-research.github.io/lingvo-lab/translatotron3
翻译:本文提出Translatotron 3,一种通过结合掩码自编码器、无监督嵌入映射和反向翻译,从单语语音-文本数据集实现无监督直接语音到语音翻译的新方法。在西班牙语与英语之间的语音到语音翻译任务上的实验结果表明,Translatotron 3在合成的Unpaired-Conversational数据集上相比基线级联系统实现了18.14 BLEU点的提升,优于后者。与需要真实配对数据或专门建模以复制停顿、语速和说话人身份等副语言/非语言信息的监督方法不同,Translatotron 3展示了保留此类信息的能力。音频样本可在http://google-research.github.io/lingvo-lab/translatotron3获取。