We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). Modern large language models (LLMs) are adept at performing various text generation tasks through zero-shot learning, prompted with instructions designed for specific objectives. This paper explores the potential of LLMs to derive linguistic information that can facilitate text generation in end-to-end ASR models. Specifically, we instruct an LLM to correct grammatical errors in an ASR hypothesis and use the LLM-derived representations to refine the output further. The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. The ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding and fed into the LLM along with a specific instruction. The decoder subsequently takes as input the LLM output to perform token predictions, combining acoustic information from the encoder and the powerful linguistic information provided by the LLM. Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13\% in word error rates across major benchmarks.
翻译:本文提出利用指令微调的大型语言模型(LLM)来指导自动语音识别(ASR)中的文本生成过程。现代大型语言模型(LLMs)擅长通过零样本学习执行各种文本生成任务,只需输入针对特定目标设计的指令即可。本文探索了LLMs在提取语言学信息方面的潜力,这些信息可用于促进端到端ASR模型中的文本生成。具体而言,我们指令一个LLM纠正ASR假设中的语法错误,并利用LLM导出的表征进一步优化输出。所提出的模型基于联合CTC与注意力架构构建,其中LLM作为解码器的前端特征提取器。待纠正的ASR假设通过CTC解码从编码器获得,并与特定指令一同输入LLM。解码器随后以LLM的输出作为输入进行词元预测,结合了来自编码器的声学信息和LLM提供的强大语言学信息。实验结果表明,所提出的LLM引导模型在多个主流基准测试中实现了约13%的词错误率相对增益。