Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T), in Mandarin-English code-switching speech recognition. We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models, i.e., 16% relative Token-based Error Rate (TER) reduction averaged on three evaluation sets, and the approach of tying speech and text latent spaces is superior to that of TTS conversion on the evaluation set which contains more homogeneous data with the training set.
翻译:中英混合语音指在同一话语中混合使用两种或多种语言的一种表达方式。采用端到端建模的自动语音识别处理此类语音因数据匮乏而颇具挑战。本研究探索文本生成与注入方法,以提升工业界常用的流式模型——Transformer-Transducer在普通话-英语混合语音识别中的性能。我们首先提出一种生成混合语音文本数据的策略,进而研究通过文本转语音显式注入或通过语音与文本潜在空间对齐隐式注入所生成文本的方法。基于包含1800小时真实普通话-英语混合语音数据训练的T-T模型实验结果表明,所提出的生成文本注入方法显著提升了T-T模型性能——三个评估集上的令牌错误率相对降低16%,且语音与文本潜在空间对齐方法在训练集同质数据占比更高的评估集上优于文本转语音转换方法。