Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more practical zero-shot scenarios. To solve this problem, we first build a parallel corpus using a multi-lingual multi-speaker text-to-speech synthesis (TTS) system and then propose the StyleS2ST model with cross-lingual speech style transfer ability based on a style adaptor on a direct S2ST system framework. Enabling continuous style space modeling of an acoustic model through parallel corpus training and non-parallel TTS data augmentation, StyleS2ST captures cross-lingual acoustic feature mapping from the source to the target language. Experiments show that StyleS2ST achieves good style similarity and naturalness in both in-set and out-of-set zero-shot scenarios.
翻译:直接语音到语音翻译(S2ST)因其相比级联S2ST具有诸多优势而逐渐流行。然而,当前研究主要聚焦于语义翻译的准确性,忽略了从源语言到目标语言的语音风格迁移。高保真度表现性并行数据的缺乏使得此类风格迁移充满挑战,尤其在更具实用性的零样本场景中。为解决该问题,我们首先利用多语言多说话人文本到语音合成(TTS)系统构建并行语料库,随后基于直接S2ST系统框架中的风格适配器,提出具备跨语言语音风格迁移能力的StyleS2ST模型。通过并行语料库训练与非并行TTS数据增强实现声学模型的连续风格空间建模,StyleS2ST捕获了从源语言到目标语言的跨语言声学特征映射。实验表明,StyleS2ST在集内与集外零样本场景下均能实现良好的风格相似度与自然度。