Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

In this paper, we propose a method to learn unified representations of multilingual speech and text with a single model, especially focusing on the purpose of speech synthesis. We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model. Therefore, we can focus on their linguistic content by treating the audio as pseudo text and can build a unified representation of speech and text. Then, we propose to train an encoder-decoder structured model with a Unit-to-Unit Translation (UTUT) objective on multilingual data. Specifically, by conditioning the encoder with the source language token and the decoder with the target language token, the model is optimized to translate the spoken language into that of the target language, in a many-to-many language translation setting. Therefore, the model can build the knowledge of how spoken languages are comprehended and how to relate them to different languages. A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST). By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, we show UTUT can perform many-to-many language STS, which has not been previously explored in the literature. Samples are available on https://choijeongsoo.github.io/utut.

翻译：本文提出了一种通过单一模型学习多语言语音与文本统一表征的方法，尤其聚焦于语音合成场景。我们将多语言语音音频表示为语音单元，即从自监督语音模型编码的语音特征的量化表征。通过将音频视为伪文本，我们能够聚焦其语言内容，从而构建语音与文本的统一表征。在此基础上，我们提出基于单元到单元翻译（UTUT）目标对编码器-解码器结构模型进行多语言数据训练。具体而言，通过将编码器与源语言标记、解码器与目标语言标记进行条件约束，模型在多对多语言翻译场景下被优化以实现源语言口语到目标语言的翻译。这使得模型能够构建口语理解机制及跨语言关联知识。经UTUT预训练的单一模型可应用于多种多语言语音与文本相关任务，如语音到语音翻译（STS）、多语言文本到语音合成（TTS）及文本到语音翻译（TTST）。通过涵盖多种语言的综合实验，我们验证了所提方法在跨语言任务中的有效性。此外，我们展示了UTUT能够执行文献中尚未探索的多对多语言STS任务。示例可访问 https://choijeongsoo.github.io/utut。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/