Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC training dataset with synthetic data generated by prompting LLMs and text-to-speech models, thereby simulating additional errors from which the model can learn. For OOD scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle named entities, we introduce retrieval-augmented correction by augmenting the input with entities retrieved from a database. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8\% -- 30\% relative WER improvements in ID and 10\% -- 33\% improvements in OOD settings.
翻译:生成式纠错(GEC)已成为提升自动语音识别(ASR)系统性能的强大后处理方法。然而,我们发现GEC模型难以泛化至训练时所见错误类型之外,限制了其在测试时(尤其是在领域外场景中)纠正新出现、未见错误的能力。这一现象在命名实体(NE)上尤为显著:除了上下文信息或实体知识不足外,新的命名实体持续涌现。为解决这些问题,我们提出了DARAG(数据与检索增强的生成式纠错),这是一种旨在提升ASR在领域内及领域外场景中GEC性能的新方法。我们通过提示大语言模型和文本转语音模型生成合成数据,以此扩充GEC训练数据集,从而模拟出可供模型学习的额外错误。对于领域外场景,我们以类似的无监督方式模拟来自新领域的测试时错误。此外,为更好地处理命名实体,我们引入了检索增强纠错机制,通过从数据库中检索到的实体信息来增强输入。我们的方法简洁、可扩展,且与领域和语言无关。我们在多个数据集和设置上进行实验,结果表明DARAG在所有基线方法中表现最优,在领域内实现了8%至30%的相对词错误率提升,在领域外设置中实现了10%至33%的提升。