In automated pronunciation assessment, recent emphasis progressively lies on evaluating multiple aspects to provide enriched feedback. However, acquiring multi-aspect-score labeled data for non-native language learners' speech poses challenges; moreover, it often leads to score-imbalanced distributions. In this paper, we propose two Acoustic Feature Mixup strategies, linearly and non-linearly interpolating with the in-batch averaged feature, to address data scarcity and score-label imbalances. Primarily using goodness-of-pronunciation as an acoustic feature, we tailor mixup designs to suit pronunciation assessment. Further, we integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation. Effective mixing of the acoustic features notably enhances overall scoring performances on the speechocean762 dataset, and detailed analysis highlights our potential to predict unseen distortions.
翻译:在自动发音评估领域,近期研究重点逐渐转向多维度评估以提供更丰富的反馈。然而,获取非母语学习者语音的多维度评分标注数据存在挑战;此外,这类数据常导致评分分布不均衡。本文提出两种声学特征混合策略——通过与批次内平均特征进行线性和非线性插值,以应对数据稀缺和评分标签不均衡问题。我们主要采用发音优良度作为声学特征,并针对发音评估任务定制混合方案。进一步地,通过将语音识别结果与原始答案音素进行比对,我们整合了细粒度错误率特征,从而为发音错误提供直接线索。声学特征的有效混合显著提升了在speechocean762数据集上的整体评分性能,详细分析结果凸显了本方法在预测未知发音失真方面的潜力。