The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset.
翻译:学习者母语(L1)与目标语(L2)之间的音系差异对发音错误检测与诊断(MDD)系统构成了显著挑战。这一挑战因标注L2数据的匮乏而进一步加剧。本文提出了一种新颖的MDD架构,该架构利用同一输入数据的多个"视角",并辅以辅助任务,以在低资源环境下学习更具区分性的语音表征。通过使用单语和多语编码器,模型学习了输入的多重视角,并捕捉了跨语言和口音的声音特性。这些编码表征进一步通过多任务设置中学习发音特征得到丰富。我们使用L2-ARCTIC数据集报告的结果优于当前最优模型,在有限L2数据集条件下,与单视角单语及多语系统相比,音素错误率分别降低了11.13%和8.60%,F1分数绝对值分别提升了5.89%和2.49%。