The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encoder with an LLM. This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. Speech encoders from the Whisper model series as well as LLMs from the Vicuna model series with different model sizes were studied. Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with Q-Formers demonstrated consistent and considerable word error rate (WER) reductions over LLMs with other connector structures. Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard. Moreover, a novel segment-level Q-Former is proposed to enable LLMs to recognise speech segments with a duration exceeding the limitation of the encoders, which results in 17% relative WER reductions over other connector structures on 90-second-long speech data.
翻译:大型语言模型的强大能力与多功能性引发了自动语音识别领域的日益关注,多项开创性研究尝试通过连接语音编码器与大型语言模型来构建集成式语音识别模型。本文对三种常用连接结构(全连接层、多头交叉注意力机制和Q-Former)进行了比较研究。实验涵盖了Whisper模型系列中的语音编码器以及不同模型规模的Vicuna模型系列中的大型语言模型。基于LibriSpeech、Common Voice和GigaSpeech常用数据集的实验表明,与其他连接结构相比,采用Q-Former结构的大型语言模型能够持续且显著降低词错误率。基于Q-Former的大型语言模型可较好地泛化至域外数据集,在未使用Switchboard任何域内训练数据的情况下,在Eval2000测试集上相较于Whisper基线语音识别模型实现了12%的相对词错误率降低。此外,本文提出了一种新颖的片段级Q-Former,使大型语言模型能够识别超过编码器时长限制的语音片段,在90秒长语音数据上相较于其他连接结构取得了17%的相对词错误率降低。