In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.
翻译:近年来,大语言模型因其卓越的性能和泛化能力而受到研究界的广泛关注。本文提出了一种结合大语言模型的语音识别模型上下文化新方法。该方法将语音识别视为基于预训练大语言模型的混合模态语言建模任务。我们提供音频特征以及可选的文本上下文标记,以解码器专用方式训练系统完成转录。由此,系统在训练过程中被隐式激励去学习如何利用非结构化的上下文信息。实验结果表明,当提供额外文本上下文时,系统性能显著提升,词错误率降低6%。此外,我们发现该方法具有竞争力:相较于基线上下文化RNN-T系统(在超过25倍规模的语音数据集上训练),我们总体词错误率降低7.5%,罕见词错误率降低17%。总体而言,我们证明仅通过适配器添加少量可训练参数,即可在保持原文本输入功能的同时,为预训练大语言模型解锁上下文化语音识别能力。