BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing

The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.

翻译：大型语言模型（LLM）的出现引发了将其卓越的语言能力扩展至语音领域的广泛兴趣。然而，语音与文本之间的模态对齐仍然是一个悬而未决的问题。当前的解决方案可分为两类策略。其一是级联方法，即使用单独训练的语音识别系统的输出（词元或状态）作为LLM的输入，这限制了其在建模语音与文本对齐方面的潜力。其二是端到端方法，该方法依赖于语音指令数据，而此类数据极难大规模收集。本文针对这些问题，提出了BLSP方法，该方法通过续写行为对齐来实现语言-语音预训练的自举。我们通过在冻结的语音编码器和LLM之间学习一个轻量级的模态适配器来实现这一目标，确保LLM无论输入模态是语音片段还是其文本转录，都表现出相同的生成行为。训练过程可分为两个步骤。第一步，提示LLM以语音转录文本为前缀生成后续文本，从而获得文本续写内容。第二步，将这些续写内容作为监督信号，以端到端的方式训练模态适配器。我们证明，这种简洁的流程能够将LLM的能力扩展至语音领域，实现语音识别、语音翻译、口语理解及语音对话，甚至在零样本跨语言场景下也有效。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日