Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more practical zero-shot scenarios. To solve this problem, we first build a parallel corpus using a multi-lingual multi-speaker text-to-speech synthesis (TTS) system and then propose the StyleS2ST model with cross-lingual speech style transfer ability based on a style adaptor on a direct S2ST system framework. Enabling continuous style space modeling of an acoustic model through parallel corpus training and non-parallel TTS data augmentation, StyleS2ST captures cross-lingual acoustic feature mapping from the source to the target language. Experiments show that StyleS2ST achieves good style similarity and naturalness in both in-set and out-of-set zero-shot scenarios.
翻译:摘要:直接语音到语音翻译(S2ST)因其相较于级联S2ST的诸多优势而逐渐流行。然而,当前研究主要聚焦于语义翻译的准确性,忽略了从源语言到目标语言的语音风格迁移。高保真表现力平行数据的缺乏使得此类风格迁移颇具挑战,尤其在更具实用性的零样本场景中。为解决该问题,我们首先利用多语种多说话人文本到语音合成(TTS)系统构建平行语料库,随后基于直接S2ST系统框架上的风格适配器提出具备跨语言语音风格迁移能力的StyleS2ST模型。通过平行语料训练与非平行TTS数据增强实现声学模型的连续风格空间建模,StyleS2ST能够捕捉从源语言到目标语言的跨语言声学特征映射。实验表明,StyleS2ST在集合内与集合外零样本场景中均实现了良好的风格相似度与自然度。