Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .
翻译:基于离散自监督表示的语音到语音直接翻译(S2ST)已取得显著精度,但无法保留源语音的说话人音色。同时,高质量说话人平行数据的稀缺性对翻译过程中的风格迁移学习构成了挑战。我们在离散自监督语音表示与编解码器单元的基础上,设计了一种具备风格迁移能力的S2ST流程。我们为风格迁移引入的声学语言模型利用自监督上下文学习,在不依赖任何说话人平行数据的情况下获得风格迁移能力,从而克服数据稀缺问题。通过使用大规模训练数据,我们的模型在未见过的源语言上实现了零样本跨语言风格迁移。实验表明,我们的模型能生成具有高保真度和说话人相似度的翻译语音。音频样本可在 http://stylelm.github.io/ 获取。