Building ASR systems robust to foreign-accented speech is an important challenge in today's globalized world. A prior study explored the way to enhance the performance of phonetic token-based ASR on accented speech by reproducing the phenomenon known as interlanguage speech intelligibility benefit (ISIB), where foreign-accented speech is more intelligible to listeners sharing the speaker's native language than to native listeners. ISIB was technically implemented by using the speaker's L1 to learn k-means cluster centroids in an SSL feature space to obtain phonetic tokens. In this study, we propose a more advanced modeling of ISIB. By employing differentiable k-means and optimizing the entire module for both L1 and L2 ASR, the proposed method outperformed the baselines, both when using only native speech and when additionally incorporating a limited amount of accented speech. Notably, in the latter scenario, our method achieved approximately a 20% relative improvement in recognition accuracy.
翻译:构建对外国口音语音具有鲁棒性的自动语音识别系统是当今全球化世界中的重要挑战。先前研究探索了通过复现跨语言语音可懂度增益现象来提升基于音素令牌的ASR在口音语音上性能的方法,该现象指外国口音语音对与说话者母语相同的听者比对本族语听者更具可懂性。该研究通过使用说话者的L1在自监督学习特征空间中学习K均值聚类中心来获得音素令牌,从而技术性实现了ISIB。本研究提出了更先进的ISIB建模方法。通过采用可微分K均值并针对L1和L2 ASR联合优化整个模块,所提方法在使用纯本族语语音及额外加入有限口音语音的两种场景下均超越了基线系统。值得注意的是,在后一种场景中,我们的方法实现了约20%的相对识别准确率提升。