Automatic Pronunciation Assessment (APA) is vital for computer-assisted language learning. Prior methods rely on annotated speech-text data to train Automatic Speech Recognition (ASR) models or speech-score data to train regression models. In this work, we propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT. Our method involves encoding speech input and corrupting them via a masking module. We then employ the Transformer encoder and apply k-means clustering to obtain token sequences. Finally, a scoring module is designed to measure the number of wrongly recovered tokens. Experimental results on speechocean762 demonstrate that the proposed method achieves comparable performance to supervised regression baselines and outperforms non-regression baselines in terms of Pearson Correlation Coefficient (PCC). Additionally, we analyze how masking strategies affect the performance of APA.
翻译:自动发音评估(APA)对计算机辅助语言学习至关重要。现有方法依赖标注的语音-文本数据训练自动语音识别(ASR)模型,或使用语音-评分数据训练回归模型。本文提出一种基于预训练声学模型HuBERT的新型零样本APA方法。我们的方法包括:通过掩蔽模块编码并破坏语音输入,随后利用Transformer编码器并应用k均值聚类获取令牌序列,最后设计评分模块度量错误恢复令牌的数量。在speechocean762数据集上的实验表明,所提方法在皮尔逊相关系数(PCC)上达到了与有监督回归基线相当的性能,并优于非回归基线。此外,我们分析了掩蔽策略对APA性能的影响。