Recently, end-to-end (E2E) automatic speech recognition (ASR) models have made great strides and exhibit excellent performance in general speech recognition. However, there remain several challenging scenarios that E2E models are not competent in, such as code-switching and named entity recognition (NER). Data augmentation is a common and effective practice for these two scenarios. However, the current data augmentation methods mainly rely on audio splicing and text-to-speech (TTS) models, which might result in discontinuous, unrealistic, and less diversified speech. To mitigate these potential issues, we propose a novel data augmentation method by applying the text-based speech editing model. The augmented speech from speech editing systems is more coherent and diversified, also more akin to real speech. The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
翻译:近期,端到端自动语音识别模型在通用语音识别领域取得了显著进展,展现出卓越性能。然而,在语码转换和命名实体识别等挑战性场景中,端到端模型仍存在能力不足的问题。数据增强是应对这两种场景的常用且有效手段。但当前数据增强方法主要依赖音频拼接和文本转语音模型,可能导致生成语音不连续、不自然且多样性不足。为解决这些潜在问题,我们提出了一种基于文本语音编辑模型的新型数据增强方法。相较于现有技术,经语音编辑系统增强后的语音更具连贯性和多样性,更接近真实语音特征。在语码转换与命名实体识别任务上的实验结果表明,本方法在性能上显著优于基于音频拼接和神经TTS的数据增强系统。