Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.
翻译:近期关于离散语音标记化的研究为能够跨模态无缝执行多种任务(如语音识别、文本转语音、语音到语音翻译)的模型开辟了道路。此外,从海量文本语料库预训练的大型语言模型(LLMs)蕴含丰富的语言学信息,可提升多种任务的准确性。本文提出一种仅含解码器的离散多模态语言模型(DMLM),该模型可灵活应用于多种任务(自动语音识别、文本转语音、语音到文本翻译等)与模态(文本、语音、视觉)。我们深入探究了离散多模态模型的若干关键方面,包括损失函数、权重初始化、混合训练监督策略及码本设计。实验结果表明,DMLM通过结合有监督与无监督训练,在多种任务与数据集上均获得显著性能提升。特别地,在自动语音识别任务中,DMLM受益于从预训练LLM进行初始化,以及基于Whisper激活构建的码本。