We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.
翻译:我们提出了PolyVoice,一种基于语言模型的语音到语音翻译(S2ST)系统框架。该框架包含两个语言模型:一个翻译语言模型和一个语音合成语言模型。我们采用完全无监督方式生成的离散化语音单元,因此该框架可适用于无文字语言。在语音合成部分,我们借鉴了现有的VALL-E X方法,构建了基于单元的音频语言模型。这使得我们的框架能够保留原始语音的声纹特征和说话风格。我们在中文→英文和英文→西班牙语两组语言对上进行了验证。实验结果表明,本系统能够生成具有高翻译质量和音频质量的语音。语音样本请访问 https://speechtranslation.github.io/polyvoice。