MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-accent scenarios. In this paper, we explore the application of GER in multi-accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-accent scenarios, making it a prominent solution. In this work, we propose a unified ASR-AR GER model, named MMGER, leveraging multi-modal correction, and multi-granularity correction. Multi-task ASR-AR learning is employed to provide dynamic 1-best hypotheses and accent embeddings. Multi-modal correction accomplishes fine-grained frame-level correction by force-aligning the acoustic features of speech with the corresponding character-level 1-best hypothesis sequence. Multi-granularity correction supplements the global linguistic information by incorporating regular 1-best hypotheses atop fine-grained multi-modal correction to achieve coarse-grained utterance-level correction. MMGER effectively mitigates the limitations of GER and tailors LLM-based ASR error correction for the multi-accent scenarios. Experiments conducted on the multi-accent Mandarin KeSpeech dataset demonstrate the efficacy of MMGER, achieving a 26.72% relative improvement in AR accuracy and a 27.55% relative reduction in ASR character error rate, compared to a well-established standard baseline.

翻译：尽管自动语音识别（ASR）取得了显著进展，但在面对不利条件时性能仍会下降。生成式纠错（GER）利用大语言模型（LLM）卓越的文本理解能力，在ASR纠错中表现出色，其中N-best假设为转录预测提供了有价值的信息。然而，GER面临固定N-best假设、声学信息利用不足以及对多口音场景特异性有限等挑战。本文探索了GER在多口音场景中的应用。口音代表了对标准发音规范的偏离，同时进行ASR和口音识别（AR）的多任务学习框架有效解决了多口音场景问题，因此成为一种主流解决方案。本文提出了一种名为MMGER的统一ASR-AR GER模型，该模型利用多模态纠正和多粒度纠正。通过多任务ASR-AR学习提供动态的1-best假设和口音嵌入。多模态纠正通过将语音的声学特征与对应的字符级1-best假设序列进行强制对齐，实现细粒度的帧级纠正。多粒度纠正在细粒度多模态纠正基础上融入常规1-best假设，以补充全局语言信息，实现粗粒度的语句级纠正。MMGER有效缓解了GER的局限性，并针对多口音场景定制了基于LLM的ASR纠错方法。在多口音普通话KeSpeech数据集上进行的实验证明了MMGER的有效性：与成熟的基线方法相比，AR准确率相对提升26.72%，ASR字符错误率相对降低27.55%。