We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples (https://michelleramanovich.github.io/spectron/spectron) and spoken QA dataset (https://github.com/google-research-datasets/LLAMA1-Test-Set).
翻译:我们提出了一种名为Spectron的新颖方法,用于适配预训练大语言模型(LLMs),以执行口语问答(QA)和语音延续。通过为LLM配备一个预训练的语音编码器,我们的模型能够接收语音输入并生成语音输出。整个系统采用端到端训练,并直接在频谱图上操作,从而简化了架构。我们方法的关键在于一个训练目标,该目标仅使用配对的语音-文本数据,联合监督语音识别、文本延续和语音合成,从而在单次解码过程中实现“跨模态”的思维链。我们的方法在说话人保持和语义连贯性方面超越了现有的口语语言模型。此外,如口语问答数据集所证明的,所提出的模型在保留原始LLM知识方面优于直接初始化方法。我们发布了音频样本(https://michelleramanovich.github.io/spectron/spectron)和口语问答数据集(https://github.com/google-research-datasets/LLAMA1-Test-Set)。