Automated speaking assessment (ASA) typically involves automatic speech recognition (ASR) and hand-crafted feature extraction from the ASR transcript of a learner's speech. Recently, self-supervised learning (SSL) has shown stellar performance compared to traditional methods. However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss reweighting, leveraging distinct SSL-based embedding features. Extensive experimental results on the ICNALE benchmark dataset suggest that our approach can outperform existing strong baselines by a sizable margin, achieving a significant improvement of more than 10% in CEFR prediction accuracy.
翻译:自动口语评估通常涉及自动语音识别以及从学习者语音的ASR转录文本中提取人工设计的特征。近年来,自监督学习方法相较于传统方法展现出了卓越的性能。然而,基于SSL的ASA系统至少面临三个与数据相关的挑战:标注数据有限、学习者熟练度水平分布不均,以及不同CEFR熟练度等级之间的分数间隔不统一。为应对这些挑战,我们探索了两种新颖的建模策略:基于度量的分类和损失重加权,并利用了不同的基于SSL的嵌入特征。在ICNALE基准数据集上进行的大量实验结果表明,我们的方法能够以较大优势超越现有的强基线模型,在CEFR预测准确率上实现了超过10%的显著提升。