Multilingual training is effective in improving low-resource ASR, which may partially be explained by phonetic representation sharing between languages. In end-to-end (E2E) ASR systems, graphemes are often used as basic modeling units, however graphemes may not be ideal for multilingual phonetic sharing. In this paper, we leverage International Phonetic Alphabet (IPA) based language-universal phonetic model to improve low-resource ASR performances, for the first time within the attention encoder-decoder architecture. We propose an adaptation method on the phonetic IPA model to further improve the proposed approach on extreme low-resource languages. Experiments carried out on the open-source MLS corpus and our internal databases show our approach outperforms baseline monolingual models and most state-of-the-art works. Our main approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios.
翻译:多语言训练能有效提升低资源自动语音识别(ASR)性能,部分原因可归因于语言间音素表征的共享。在端到端(E2E)ASR系统中,字素常被用作基本建模单元,然而字素可能并非多语言音素共享的理想选择。本文首次在注意力编码器-解码器架构中,利用基于国际音标(IPA)的语言通用音素模型来提升低资源ASR性能。我们提出了一种针对音素IPA模型的适配方法,以进一步改进所提方法在极端低资源语言上的表现。基于开源MLS语料库和内部数据库的实验表明,我们的方法优于基线单语模型及大多数现有最先进成果。即使在领域与语言不匹配的场景下,我们的主方法及其适配策略对极端低资源语言均有效。