How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.
翻译:如何利用文本数据增强语音预训练是一个尚未解决的问题,原因在于语音和文本是两种具有截然不同特性的模态。本文提出了跨模态语音与语言模型(SpeechLM),通过预定义的统一离散表示显式地对齐语音和文本预训练。具体而言,我们引入了两种可选的离散分词器来桥接语音与文本模态,包括音素单元分词器和隐含单元分词器,它们可通过少量配对的语音-文本数据进行训练。基于训练好的分词器,我们将未标注的语音和文本数据转化为音素单元或隐含单元的标记。预训练目标旨在将语音和文本统一到同一离散语义空间中,并采用统一的Transformer网络实现。我们在语音识别、语音翻译以及通用表示评估框架SUPERB等多种口语语言处理任务上评估了SpeechLM,展示了其在内容相关任务上的显著改进。代码和模型可从 https://aka.ms/SpeechLM 获取。