The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [1]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. We also observed a 4.6% increase in F1-score with Arabic AraVoiceL2 testset.
翻译:缺乏标注的第二语言语音数据是设计发音错误检测模型面临的主要挑战。本文提出SpeechBlender——一种用于生成发音错误的细粒度数据增强流程,以克服此类数据稀缺问题。SpeechBlender利用多种掩码针对语音单元的不同区域,并通过混合因子对原始语音信号进行线性插值以增强发音。这些掩码实现了信号的平滑融合,生成比"剪切/粘贴"方法更有效的样本。我们提出的技术在Speechocean762数据集上,针对基于ASR的发音错误检测模型(音素级别)取得了最先进结果,相较于此前最先进方法[1],皮尔逊相关系数提升了2.0%。此外,我们证明相较于基线模型,音素级别性能提升了5.0%。在阿拉伯语AraVoiceL2测试集上,我们还观察到F1分数提高了4.6%。