This paper presents CrossVoice, a novel cascade-based Speech-to-Speech Translation (S2ST) system employing advanced ASR, MT, and TTS technologies with cross-lingual prosody preservation through transfer learning. We conducted comprehensive experiments comparing CrossVoice with direct-S2ST systems, showing improved BLEU scores on tasks such as Fisher Es-En, VoxPopuli Fr-En and prosody preservation on benchmark datasets CVSS-T and IndicTTS. With an average mean opinion score of 3.75 out of 4, speech synthesized by CrossVoice closely rivals human speech on the benchmark, highlighting the efficacy of cascade-based systems and transfer learning in multilingual S2ST with prosody transfer.
翻译:本文提出CrossVoice,一种新颖的基于级联架构的语音到语音翻译系统,该系统采用先进的自动语音识别、机器翻译和文本到语音技术,并通过迁移学习实现跨语言韵律保持。我们进行了全面的实验,将CrossVoice与直接语音到语音翻译系统进行比较,结果显示在Fisher Es-En、VoxPopuli Fr-En等任务上BLEU分数有所提升,并在基准数据集CVSS-T和IndicTTS上实现了韵律保持。CrossVoice合成的语音在基准测试中平均意见得分达3.75分(满分4分),与真人语音高度接近,这凸显了级联系统和迁移学习在多语言语音到语音翻译及韵律迁移中的有效性。