BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing

The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.

翻译：大语言模型的出现激发了人们将其卓越的语言能力扩展到语音领域的浓厚兴趣。然而，语音与文本之间的模态对齐仍是一个悬而未决的问题。现有解决方案可分为两种策略：一种是级联方法，即将独立训练的语音识别系统输出的令牌或状态作为大语言模型的输入，但这种方法限制了模型在语音-文本对齐建模方面的潜力；另一种是端到端方法，该方法依赖语音指令数据，然而大规模收集此类数据极为困难。本文针对上述问题提出BLSP方法，通过续写行为对齐来引导语言-语音预训练。具体实现是：在冻结的语音编码器与大语言模型之间学习一个轻量级模态适配器，确保大语言模型对语音片段或其文本转录两种模态输入表现出相同的生成行为。训练过程分为两步：第一步，利用语音转录作为前缀提示大语言模型生成文本续写内容；第二步，将这些续写内容作为监督信号，以端到端方式训练模态适配器。实验证明，这种简洁的流程能将大语言模型的能力扩展到语音领域，在零样本跨语言场景中实现语音识别、语音翻译、口语理解和语音对话等任务。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日