We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.
翻译:我们提出了一种新的跨模态融合技术,专为自动语音识别(ASR)中的生成式纠错设计。该方法结合了声学信息与外部语言表示,以生成准确的语音转录上下文,标志着在n-best假设领域中向生成式纠错新范式的迈进。与现有的基于排序的重评分方法不同,我们的方法巧妙利用不同的初始化技术和参数高效算法,从预训练的语音与文本模型中提升ASR性能。通过在多种ASR数据集上的评估,我们验证了该融合技术的稳定性和可复现性,并展示了其相对于n-best假设的词错误率相对降低(WERR)性能,提升达37.66%。为促进未来研究,我们已在https://github.com/Srijith-rkr/Whispering-LLaMA开源了代码和预训练模型。