Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech.
翻译:将语音理解能力融入预训练大语言模型已成为关键研究方向(SpeechLLM)。现有架构主要分为两类:i) GPT风格,将语音提示作为LLM输入序列前置到文本提示前,类似仅解码器模型;ii) T5风格,在预训练LLM的每一层引入语音交叉注意力。我们提出BESTOW架构,融合两个世界的优势特征,构建出高效且具备强大多任务能力的单一模型。此外,现有两种风格均缺乏明确的流式处理方案,尤其考虑到方案需推广至语音多任务场景。我们将可流式SpeechLLM重新定义为读写策略问题,并通过BESTOW架构统一离线和流式研究。由此,我们首次展示了同时支持大规模流式处理与多任务(超越ASR)的开源SpeechLLM解决方案。该流式方案在广泛语音任务(ASR、AST、SQA、未见过的DynamicSuperb)上表现出色,具备端到端可优化性、更低的训练/推理成本,并证明了LLM知识向语音领域的可迁移性。