We propose a novel text-to-speech (TTS) data augmentation framework for low resource automatic speech recognition (ASR) tasks, named phoneme audio mix up (PAMP). The PAMP method is highly interpretable and can incorporate prior knowledge of pronunciation rules. Furthermore, PAMP can be easily deployed in almost any language, extremely for low resource ASR tasks. Extensive experiments have demonstrated the great effectiveness of PAMP on low resource ASR tasks: we achieve a \textbf{10.84\%} character error rate (CER) on the common voice Cantonese ASR task, bringing a great relative improvement of about \textbf{30\%} compared to the previous state-of-the-art which was achieved by fine-tuning the wav2vec2 pretrained model.
翻译:我们提出了一种新颖的文本-语音(TTS)数据增强框架,用于低资源自动语音识别(ASR)任务,命名为音素音频混合(PAMP)。PAMP方法具有高度可解释性,并能够融入发音规则的先验知识。此外,PAMP可轻松部署于几乎所有语言中,尤其适用于低资源ASR任务。大量实验证明了PAMP在低资源ASR任务上的卓越有效性:在Common Voice粤语ASR任务中,我们实现了\textbf{10.84\%}的字错误率(CER),相比此前通过微调wav2vec2预训练模型取得的最佳结果,相对提升了约\textbf{30\%}。