SpiRit-LM: Interleaved Spoken and Written Language Model

Tu Anh Nguyen,Benjamin Muller,Bokai Yu,Marta R. Costa-jussa,Maha Elbayad,Sravya Popuri,Paul-Ambroise Duquenne,Robin Algayres,Ruslan Mavlyutov,Itai Gat,Gabriel Synnaeve,Juan Pino,Benoit Sagot,Emmanuel Dupoux

We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).

翻译：我们提出了SPIRIT-LM，一种能够自由混合文本和语音的基础多模态语言模型。该模型基于预训练的文本语言模型，通过持续在文本和语音单元上进行训练，将其扩展至语音模态。语音和文本序列被拼接为单一标记集，并采用基于小规模自动筛选的语音-文本平行语料库的词级交织方法进行训练。SPIRIT-LM包含两个版本：基础版使用语音语义单元，而表现力版在语义单元基础上通过音高和风格单元对表现力进行建模。两个版本中，文本均通过子词BPE标记进行编码。最终模型既展现了文本模型的语义能力，又兼顾了语音模型的表现力特性。此外，我们证明了SPIRIT-LM能够通过少量样本方式跨模态学习新任务（如自动语音识别、文本转语音、语音分类）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日