Textless speech-to-speech translation systems are rapidly advancing, thanks to the integration of self-supervised learning techniques. However, existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages. Expressivity plays a vital role in conveying emotions, nuances, and cultural subtleties, thereby enhancing communication across diverse languages. To address this issue this study presents a novel method that operates at the discrete speech unit level and leverages multilingual emotion embeddings to capture language-agnostic information. Specifically, we demonstrate how these embeddings can be used to effectively predict the pitch and duration of speech units in the target language. Through objective and subjective experiments conducted on a French-to-English translation task, our findings highlight the superior expressivity transfer achieved by our approach compared to current state-of-the-art systems.
翻译:无文本语音到语音翻译系统通过自监督学习技术的集成而迅速发展。然而,现有最先进系统在跨语言准确捕捉和迁移表达性方面仍存在不足。表达性在传递情感、细微差别和文化细节方面起着关键作用,从而增强不同语言间的交流。为解决这一问题,本研究提出了一种新颖方法,该方法在离散语音单元层面运行,并利用多语言情感嵌入来捕捉语言无关信息。具体而言,我们展示了这些嵌入如何有效预测目标语言中语音单元的基频和时长。通过法译英翻译任务的主客观实验,我们的研究结果表明,与当前最先进系统相比,本方法在表达迁移方面具有显著优势。