In this paper, we propose a method to learn unified representations of multilingual speech and text with a single model, especially focusing on the purpose of speech synthesis. We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model. Therefore, we can focus on their linguistic content by treating the audio as pseudo text and can build a unified representation of speech and text. Then, we propose to train an encoder-decoder structured model with a Unit-to-Unit Translation (UTUT) objective on multilingual data. Specifically, by conditioning the encoder with the source language token and the decoder with the target language token, the model is optimized to translate the spoken language into that of the target language, in a many-to-many language translation setting. Therefore, the model can build the knowledge of how spoken languages are comprehended and how to relate them to different languages. A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST). By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, we show UTUT can perform many-to-many language STS, which has not been previously explored in the literature. Samples are available on https://choijeongsoo.github.io/utut.
翻译:本文提出了一种通过单一模型学习多语言语音与文本统一表征的方法,尤其聚焦于语音合成场景。我们将多语言语音音频表示为语音单元,即从自监督语音模型编码的语音特征的量化表征。通过将音频视为伪文本,我们能够聚焦其语言内容,从而构建语音与文本的统一表征。在此基础上,我们提出基于单元到单元翻译(UTUT)目标对编码器-解码器结构模型进行多语言数据训练。具体而言,通过将编码器与源语言标记、解码器与目标语言标记进行条件约束,模型在多对多语言翻译场景下被优化以实现源语言口语到目标语言的翻译。这使得模型能够构建口语理解机制及跨语言关联知识。经UTUT预训练的单一模型可应用于多种多语言语音与文本相关任务,如语音到语音翻译(STS)、多语言文本到语音合成(TTS)及文本到语音翻译(TTST)。通过涵盖多种语言的综合实验,我们验证了所提方法在跨语言任务中的有效性。此外,我们展示了UTUT能够执行文献中尚未探索的多对多语言STS任务。示例可访问 https://choijeongsoo.github.io/utut。