ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data, as well as its generalizability and superiority on large-scale data. Finally, a study on speech emotion recognition confirms that our model produces ASR error-robust transcripts suitable for downstream applications.
翻译:自动语音识别(ASR)在说话风格与训练ASR系统所用风格存在差异的场景中表现仍不理想,导致转录文本出现错误。为解决此问题,需要采用ASR纠错(AEC)这一后处理技术。本研究针对一个尚未被充分探索的问题:低资源域外(LROOD)问题,通过研究在仅使用1-最佳假设转录的极有限下游数据上进行跨模态AEC。我们探索了预训练与微调策略,揭示了ASR领域差异现象,为LROOD数据提供了合适的训练方案启示。此外,我们提出引入离散语音单元以对齐并增强词嵌入,从而提升AEC质量。多个语料库和多项评估指标的结果表明,我们提出的AEC方法在LROOD数据上具有可行性和有效性,同时在大规模数据上也展现出良好的泛化能力和优越性。最后,一项语音情感识别研究证实,我们的模型能够生成对ASR错误具有鲁棒性的转录文本,适用于下游应用。