Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English translations, SeamlessExpressiveLM outperforms cascaded LMs in both semantic quality and style transfer, meanwhile achieving better parameter efficiency.
翻译:富有表现力的语音到语音翻译(S2ST)是无缝通信领域的一个关键研究课题,其重点在于在翻译后的语音中保持语义和说话者声音风格。早期工作通过合成与说话者风格对齐的语音,以直接学习从源语音到目标语音频谱图的映射。近期研究不依赖于风格对齐数据,而是利用语言建模(LM)的进展,在语义和声学标记上构建级联语言模型。本文提出SeamlessExpressiveLM,一个用于富有表现力S2ST的单一语音语言模型。我们通过思维链提示,将复杂的源到目标语音映射分解为中间生成步骤。该模型首先被引导翻译目标语义内容,然后将说话者风格迁移到多流声学单元。在西班牙语到英语和匈牙利语到英语的翻译任务上的评估表明,SeamlessExpressiveLM在语义质量和风格迁移方面均优于级联语言模型,同时实现了更好的参数效率。