There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
翻译:近年来,直接实现从一种语言到另一种语言的语音翻译(即端到端语音到语音翻译)的研究兴趣与趋势日益增长。然而,大多数端到端模型难以超越级联模型,即通过串联语音识别、机器翻译和文本到语音模型构成的流水线框架。主要挑战源于直接翻译任务固有的复杂性以及数据的稀缺性。在本研究中,我们提出了一种新颖的模型框架TransVIP,该框架以级联方式利用多样化数据集,同时通过联合概率实现端到端推理。此外,我们提出了两个独立的编码器,用于在翻译过程中保持源语音的说话人音色特征与等时性,使其非常适用于视频配音等场景。我们在法语-英语语言对上进行的实验表明,我们的模型性能优于当前最先进的语音到语音翻译模型。