The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
翻译:大型语言模型(LLM)的出现引发了将其卓越的语言能力扩展至语音领域的广泛兴趣。然而,语音与文本之间的模态对齐仍然是一个悬而未决的问题。当前的解决方案可分为两类策略。其一是级联方法,即使用单独训练的语音识别系统的输出(词元或状态)作为LLM的输入,这限制了其在建模语音与文本对齐方面的潜力。其二是端到端方法,该方法依赖于语音指令数据,而此类数据极难大规模收集。本文针对这些问题,提出了BLSP方法,该方法通过续写行为对齐来实现语言-语音预训练的自举。我们通过在冻结的语音编码器和LLM之间学习一个轻量级的模态适配器来实现这一目标,确保LLM无论输入模态是语音片段还是其文本转录,都表现出相同的生成行为。训练过程可分为两个步骤。第一步,提示LLM以语音转录文本为前缀生成后续文本,从而获得文本续写内容。第二步,将这些续写内容作为监督信号,以端到端的方式训练模态适配器。我们证明,这种简洁的流程能够将LLM的能力扩展至语音领域,实现语音识别、语音翻译、口语理解及语音对话,甚至在零样本跨语言场景下也有效。