This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.
翻译:本文提出了一种使用离散声学单元的端到端文本到语音翻译系统。该框架以不同源语言的文本作为输入,生成目标语言的语音,无需该语言的文本转录。受先前研究中声学单元在直接语音到语音翻译系统中成功应用的启发,我们采用相同的流程,通过语音编码器结合聚类算法提取声学单元。获得声学单元后,训练一个编码器-解码器架构来预测这些单元,随后利用声码器从单元生成语音。本研究的直接文本到语音翻译方法在全新的CVSS语料库上进行了测试,并采用两种不同的文本mBART模型进行初始化。实验结果表明,所提出的系统在大多数评估语言对中均表现出竞争力的性能。此外,使用预训练更多语言的模型初始化所提架构时,翻译效果得到了显著提升。