We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
翻译:我们提出了SPIRIT-LM,一种能够自由混合文本和语音的基础多模态语言模型。该模型基于预训练的文本语言模型,通过持续在文本和语音单元上进行训练,将其扩展至语音模态。语音和文本序列被拼接为单一标记集,并采用基于小规模自动筛选的语音-文本平行语料库的词级交织方法进行训练。SPIRIT-LM包含两个版本:基础版使用语音语义单元,而表现力版在语义单元基础上通过音高和风格单元对表现力进行建模。两个版本中,文本均通过子词BPE标记进行编码。最终模型既展现了文本模型的语义能力,又兼顾了语音模型的表现力特性。此外,我们证明了SPIRIT-LM能够通过少量样本方式跨模态学习新任务(如自动语音识别、文本转语音、语音分类)。