How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative WER reduction over the best base model performance (from 6.8 to 5.7) on the public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also evaluate our SpeechLM on various spoken language processing tasks under the universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Our code and models are available at https://aka.ms/SpeechLM.
翻译:如何利用文本数据增强语音预训练是一个尚未解决的问题,原因在于语音和文本是两种截然不同的模态,具有迥异的特性。本文提出一种跨模态语音语言模型(SpeechLM),通过预定义的统一离散表示显式地对齐语音与文本预训练。具体地,我们引入两种可选的离散分词器以桥接语音与文本模态,包括音素单元分词器与隐单元分词器,这些分词器可通过少量成对的语音-文本数据训练得到。基于训练好的分词器,我们将未标注的语音与文本数据转换为音素单元或隐单元的token序列。预训练目标旨在通过统一的Transformer网络将语音与文本统一到同一离散语义空间中。仅利用10K句文本数据,我们的SpeechLM在公开LibriSpeech ASR基准测试中,相较于最优基础模型性能实现了16%的相对词错误率降低(从6.8降至5.7)。此外,参数更少的SpeechLM在CoVoST-2语音翻译任务上甚至超越了此前的最优模型。我们还在通用表示评估框架SUPERB下对SpeechLM进行了多种口语语言处理任务评估,结果表明其在与内容相关的任务上取得了显著提升。我们的代码与模型已开源至https://aka.ms/SpeechLM。