Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent.

翻译：发音错误的自动识别与诊断（MDD）在计算机辅助发音学习（CAPL）工具（如第二语言（L2）学习或言语治疗应用）中发挥着关键作用。现有基于音素分析的MDD方法仅能检测到具有足够训练数据可建模的音素类别性错误。由于非母语或言语障碍者的发音错误具有不可预测性，且训练数据集稀缺，因此对所有类型的发音错误进行建模是不可行的。此外，音素级MDD方法在提供关于所犯错误的详细诊断信息方面能力有限。本文提出一种基于语音属性特征检测的低层级MDD方法。语音属性特征将音素产生分解为与发音系统直接相关的基本成分，从而为学习者提供更具形成性的反馈。我们进一步提出联结主义时序分类（CTC）的多标签变体，利用单一模型联合建模非互斥的语音属性。采用预训练的wav2vec2模型作为语音属性检测的核心模型。所提方法应用于来自不同母语英语学习者的L2语音语料库。进一步将所提语音属性MDD方法与传统的音素级MDD进行比较，在所有语音属性上，相较于音素级方法，该方法实现了显著更低的误接受率（FAR）、误拒绝率（FRR）和诊断错误率（DER）。