The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
翻译:大语言模型的出现激发了人们将其卓越的语言能力扩展到语音领域的浓厚兴趣。然而,语音与文本之间的模态对齐仍是一个悬而未决的问题。现有解决方案可分为两种策略:一种是级联方法,即将独立训练的语音识别系统输出的令牌或状态作为大语言模型的输入,但这种方法限制了模型在语音-文本对齐建模方面的潜力;另一种是端到端方法,该方法依赖语音指令数据,然而大规模收集此类数据极为困难。本文针对上述问题提出BLSP方法,通过续写行为对齐来引导语言-语音预训练。具体实现是:在冻结的语音编码器与大语言模型之间学习一个轻量级模态适配器,确保大语言模型对语音片段或其文本转录两种模态输入表现出相同的生成行为。训练过程分为两步:第一步,利用语音转录作为前缀提示大语言模型生成文本续写内容;第二步,将这些续写内容作为监督信号,以端到端方式训练模态适配器。实验证明,这种简洁的流程能将大语言模型的能力扩展到语音领域,在零样本跨语言场景中实现语音识别、语音翻译、口语理解和语音对话等任务。