We propose a method for speech-to-speech emotionpreserving translation that operates at the level of discrete speech units. Our approach relies on the use of multilingual emotion embedding that can capture affective information in a language-independent manner. We show that this embedding can be used to predict the pitch and duration of speech units in a target language, allowing us to resynthesize the source speech signal with the same emotional content. We evaluate our approach to English and French speech signals and show that it outperforms a baseline method that does not use emotional information, including when the emotion embedding is extracted from a different language. Even if this preliminary study does not address directly the machine translation issue, our results demonstrate the effectiveness of our approach for cross-lingual emotion preservation in the context of speech resynthesis.
翻译:我们提出了一种基于离散语音单元级别的语音到语音情感保留翻译方法。该方法利用多语言情感嵌入,能够以语言无关的方式捕捉情感信息。研究表明,该嵌入可用于预测目标语言中语音单元的基频和时长,从而在保留相同情感内容的前提下重新合成源语音信号。我们在英语和法语语音信号上评估了该方法,结果表明其优于不使用情感信息的基线方法,即使情感嵌入提取自不同语言时也是如此。尽管这项初步研究尚未直接解决机器翻译问题,但我们的结果证明了该方法在语音重合成语境下实现跨语言情感保留的有效性。