Automatic pronunciation assessment (APA) manages to quantify the pronunciation proficiency of a second language (L2) learner in a language. Prevailing approaches to APA normally leverage neural models trained with a regression loss function, such as the mean-squared error (MSE) loss, for proficiency level prediction. Despite most regression models can effectively capture the ordinality of proficiency levels in the feature space, they are confronted with a primary obstacle that different phoneme categories with the same proficiency level are inevitably forced to be close to each other, retaining less phoneme-discriminative information. On account of this, we devise a phonemic contrast ordinal (PCO) loss for training regression-based APA models, which aims to preserve better phonemic distinctions between phoneme categories meanwhile considering ordinal relationships of the regression target output. Specifically, we introduce a phoneme-distinct regularizer into the MSE loss, which encourages feature representations of different phoneme categories to be far apart while simultaneously pulling closer the representations belonging to the same phoneme category by means of weighted distances. An extensive set of experiments carried out on the speechocean762 benchmark dataset suggest the feasibility and effectiveness of our model in relation to some existing state-of-the-art models.
翻译:自动发音评估(APA)旨在量化第二语言(L2)学习者的发音熟练程度。当前主流APA方法通常采用基于回归损失函数(如均方误差损失)训练的神经网络模型进行熟练度等级预测。尽管多数回归模型能在特征空间有效捕捉熟练度等级的有序性,但其面临一个主要障碍:不同音位类别若具有相同熟练度等级,将不可避免地被强制拉近,导致音位判别信息减少。为此,我们提出一种音位对比有序(PCO)损失函数,用于训练基于回归的APA模型。该函数在考虑回归目标输出有序关系的同时,力图更好保留不同音位类别间的音位区分度。具体而言,我们在均方误差损失中引入音位判别正则化项:通过加权距离使不同音位类别的特征表示相互远离,同时将同音位类别的特征表示相互拉近。在speechocean762基准数据集上进行的大量实验表明,与若干现有最优模型相比,我们的模型具有可行性与有效性。