The integration of Language Models (LMs) has proven to be an effective way to address domain shifts in speech recognition. However, these approaches usually require a significant amount of target domain text data for the training of LMs. Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM). LLM is used in two ways: 1) second-pass rescoring: reranking N-best hypotheses of a given ASR system with LLaMA; 2) deep LLM-fusion: incorporating LLM into the decoder of an encoder-decoder based ASR system. Experiments show that, with only one domain prompt, both methods can effectively reduce word error rates (WER) on out-of-domain TedLium-2 and SPGISpeech datasets. Especially, the deep LLM-fusion has the advantage of better recall of entity and out-of-vocabulary words.
翻译:语言模型(LMs)的集成已被证明是处理语音识别中领域偏移的有效方法。然而,这些方法通常需要大量目标领域文本数据来训练语言模型。与此不同,本文仅使用领域特定的文本提示,提出两种基于LLaMA(一个拥有70亿参数的大型语言模型)的零样本ASR领域自适应方法。大型语言模型以两种方式使用:1)二次重评分:利用LLaMA对给定ASR系统的N最佳假设进行重新排序;2)深度语言模型融合:将大型语言模型集成到基于编码器-解码器的ASR系统的解码器中。实验表明,仅使用一个领域提示,两种方法均可有效降低TedLium-2和SPGISpeech数据集上跨领域场景的词错误率(WER)。特别地,深度语言模型融合在召回实体和词汇表外单词方面具有更优表现。