Recently, connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models have achieved impressive results, especially with the development of self-supervised learning. However, E2E ASR models trained on paired speech-text data often suffer from domain shifts from training to testing. To alleviate this issue, this paper proposes a flat-start joint training method, named FastInject, which efficiently injects multi-domain unpaired text data into CTC-based ASR training. To maintain training efficiency, text units are pre-upsampled, and their representations are fed into the CTC model along with speech features. To bridge the modality gap between speech and text, an attention-based modality matching mechanism (AM3) is proposed, which retains the E2E flat-start training. Experiments show that the proposed FastInject gave a 22\% relative WER reduction (WERR) for intra-domain Librispeech-100h data and 20\% relative WERR on out-of-domain test sets.
翻译:最近,基于连接主义时序分类(CTC)的端到端自动语音识别(ASR)模型取得了令人瞩目的成果,尤其是在自监督学习发展的推动下。然而,使用配对语音-文本数据训练的端到端ASR模型常面临训练与测试之间的领域偏移问题。为缓解这一问题,本文提出了一种名为FastInject的平坦起始联合训练方法,该方法能高效地将多领域的未配对文本数据注入基于CTC的ASR训练中。为保持训练效率,文本单元被预先上采样,其表示与语音特征一同输入CTC模型。为弥合语音与文本之间的模态差异,提出了一种基于注意力的模态匹配机制(AM3),该机制保留了端到端的平坦起始训练。实验表明,所提出的FastInject在领域内Librispeech-100h数据上实现了22%的相对词错误率降低(WERR),在领域外测试集上实现了20%的相对词错误率降低。