We present a novel approach to adapting pre-trained large language models (LLMs) to perform question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. Audio samples can be found at https://michelleramanovich.github.io/spectron/spectron
翻译:我们提出了一种新颖的方法,使预训练的大型语言模型(LLM)能够执行问答(QA)和语音续写任务。通过为LLM配备预训练的语音编码器,我们的模型能够接收语音输入并生成语音输出。整个系统以端到端方式训练,并直接基于频谱图运行,从而简化了架构。我们方法的核心在于一个训练目标,该目标仅利用配对的语音-文本对即可联合监督语音识别、文本续写和语音合成,从而在单次解码过程中实现“跨模态”思维链。我们的方法在说话者保留和语义连贯性方面超越了现有的口语语言模型。此外,如口语问答数据集所示,所提出的模型在保留原始LLM知识方面相较于直接初始化方法有所改进。音频样可在https://michelleramanovich.github.io/spectron/spectron获取。