Speech-to-speech translation (S2ST) enables spoken communication between people talking in different languages. Despite a few studies on multilingual S2ST, their focus is the multilinguality on the source side, i.e., the translation from multiple source languages to one target language. We present the first work on multilingual S2ST supporting multiple target languages. Leveraging recent advance in direct S2ST with speech-to-unit and vocoder, we equip these key components with multilingual capability. Speech-to-masked-unit (S2MU) is the multilingual extension of S2U, which applies masking to units which don't belong to the given target language to reduce the language interference. We also propose multilingual vocoder which is trained with language embedding and the auxiliary loss of language identification. On benchmark translation testsets, our proposed multilingual model shows superior performance than bilingual models in the translation from English into $16$ target languages.
翻译:语音到语音翻译(S2ST)能够实现不同语言使用者之间的口语交流。尽管已有少数关于多语种S2ST的研究,但其重点在于源语言侧的多语性,即从多个源语言翻译到一个目标语言。我们提出了首个支持多个目标语言的多语种S2ST工作。利用近期在直接S2ST中结合语音到单元和声码器的进展,我们为这些关键组件赋予了多语能力。语音到掩蔽单元(S2MU)是S2U的多语扩展,它通过掩蔽不属于给定目标语言的单元来减少语言干扰。我们还提出了多语声码器,该模型通过语言嵌入和语言识别的辅助损失进行训练。在基准翻译测试集上,我们提出的多语模型在英语到16个目标语言的翻译中展现出优于双语模型的性能。