Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability.
翻译:语音和文本是人类语言的两种主要形式。多年来,研究界一直致力于语音到文本或文本到语音的映射。然而,在语言建模领域,针对二者联合建模的研究鲜有涉及。鉴于此,我们探索了语音单元与文本的联合语言建模。具体而言,我们比较了不同语音标记器将连续语音信号转化为离散单元的方法,并采用不同方式构建混合语音-文本数据。我们引入了自动评估指标,以衡量联合语言模型混合语音和文本的能力。此外,我们还在下游口语理解任务中,针对不同模态(语音或文本)对语言模型进行微调,并通过测试其性能来评估模型对共享表征的学习效果。实验结果表明,通过我们提出的混合技术将语音单元与文本结合,联合语言模型在口语理解任务上优于仅基于语音的基线模型,并展现出零样本跨模态迁移能力。