Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer-aided language learning (CALL) systems. Most systems implementing phoneme-level MDD through goodness of pronunciation (GOP), however, rely on pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general segmentation-free method that takes all possible segmentations of the canonical transcription into account (GOP-SF). We give a theoretical account of our definition of GOP-SF, an implementation that solves potential numerical issues as well as a proper normalization which allows the use of acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-SF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.
翻译:误发音检测与诊断是现代计算机辅助语言学习系统中的重要组成部分。然而,大多数通过发音优良度实现音素级误发音检测与诊断的系统依赖于将语音预先分割为音素单元。这限制了此类方法的准确性,也阻碍了使用基于CTC的现代声学模型进行评估的可能性。在本研究中,我们首先提出自对齐发音优良度,使得能够使用基于CTC训练的自动语音识别模型进行误发音检测与诊断。随后,我们定义了一种更通用的免分割方法,该方法考虑了规范音标的所有可能分割方式。我们给出了发音优良度免分割方法的理论阐述、解决潜在数值问题的实施方案,以及适当的归一化方法——该归一化允许使用具有不同时间峰值特性的声学模型。我们在CMU Kids和speechocean762数据集上提供了广泛的实验结果,比较了我们方法的不同定义,评估了发音优良度免分割方法对声学模型峰值特性及目标音素上下文长度的依赖性。最后,我们将所提方法与近期研究在speechocean762数据上进行比较,结果表明基于该方法提取的特征向量在音素级发音评估任务中取得了最先进的性能。