We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.
翻译:我们提出一种新颖的跨模态融合技术,专为自动语音识别(ASR)中的生成式纠错而设计。该方法同时利用声学信息和外部语言表征来生成准确的语音转录上下文,标志着在n-best假设领域内生成式纠错范式迈出新的一步。与现有的基于重排序的评分方法不同,我们的方法巧妙运用独特的初始化技术和参数高效算法,从预训练的语音与文本模型中提升ASR性能。通过在多样化的ASR数据集上进行评估,我们验证了该融合技术的稳定性与可复现性,相较于n-best假设,其相对词错误率(WERR)性能改善了37.66%。为促进未来研究,我们已在https://github.com/Srijith-rkr/Whispering-LLaMA 开源代码与预训练模型。