The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encoder with an LLM. This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. Speech encoders from the Whisper model series as well as LLMs from the Vicuna model series with different model sizes were studied. Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with Q-Formers demonstrated consistent and considerable word error rate (WER) reductions over LLMs with other connector structures. Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard. Moreover, a novel segment-level Q-Former is proposed to enable LLMs to recognise speech segments with a duration exceeding the limitation of the encoders, which results in 17% relative WER reductions over other connector structures on 90-second-long speech data.
翻译:大语言模型(LLMs)的卓越能力和多功能性已引起自动语音识别(ASR)领域越来越多的关注,多项开创性研究尝试通过连接语音编码器与LLM来构建集成ASR模型。本文对三种常用连接器结构——全连接层、多头交叉注意力机制和Q-Former——进行了比较研究。研究中采用了Whisper模型系列的语音编码器以及不同参数规模的Vicuna模型系列LLM。实验在常见的LibriSpeech、Common Voice和GigaSpeech数据集上进行,结果表明,采用Q-Former的LLM相比其他连接器结构,其词错误率(WER)获得了持续且显著的降低。基于Q-Former的LLM能够良好地泛化至域外数据集:在未使用任何Switchboard域内训练数据的情况下,针对Eval2000测试集相比Whisper基线ASR模型取得了12%的相对WER降低。此外,本文提出了一种新颖的片段级Q-Former,使LLM能够识别时长超出编码器限制的语音片段:针对90秒长语音数据,该结构相比其他连接器结构实现了17%的相对WER降低。