The awareness for biased ASR datasets or models has increased notably in recent years. Even for English, despite a vast amount of available training data, systems perform worse for non-native speakers. In this work, we improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation. We include phonetic knowledge in the ACM training to provide accurate feedback about how well certain pronunciation patterns were recovered in the synthesized waveform. Furthermore, we investigate the feasibility of learned accent representations instead of static embeddings. Generated data was then used to train two state-of-the-art ASR systems. We evaluated our approach on native and non-native English datasets and found that synthetically accented data helped the ASR to better understand speech from seen accents. This observation did not translate to unseen accents, and it was not observed for a model that had been pre-trained exclusively with native speech.
翻译:近年来,学术界对有偏见的自动语音识别(ASR)数据集或模型的关注显著增加。即使对于英语而言,尽管有大量可用的训练数据,系统对非母语者的表现仍然较差。在本研究中,我们改进了一种口音转换模型(ACM),该模型能将美国本土英语语音转换为带口音的发音。我们在ACM训练中融入语音学知识,以提供关于合成波形中特定发音模式恢复程度的准确反馈。此外,我们探讨了学习型口音表征替代静态嵌入的可行性。随后,利用生成的数据训练了两个最先进的ASR系统。我们在母语和非母语英语数据集上评估了该方法,发现合成口音数据有助于ASR更好地理解已见口音的语音。然而,这一观察结果并未推广到未见口音,且对于仅使用母语语音进行预训练的模型也未观察到类似效果。