Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency. It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we release the code, models and data for the entire pipeline of Whistle at https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10.

翻译：当前多语言与跨语言自动语音识别（MCL-ASR）主要存在三种方法：基于音素或字素转录的监督式预训练，以及自监督预训练。我们发现，音素监督预训练方法在MCL-ASR领域迄今未受到足够重视，而其在概念上更有利于不同语言间的信息共享。本文探索了一种基于弱语音监督的预训练方法，旨在实现数据高效的MCL-ASR，该方法称为Whistle。我们放宽了对人工标注标准音素转录的要求，通过利用LanguageNet字素到音素（G2P）模型获得基于国际音标（IPA）的转录文本。基于CommonVoice数据集构建了一个通用实验设置CV-Lang10，包含10种已见语言和2种未见语言。我们在CV-Lang10上开展了一系列实验，以尽可能公平地在统一设置下比较这三种MCL-ASR方法。实验证明了基于音素的模型（Whistle）在MCL-ASR中的多方面优势，包括：已见语言的语音识别性能、不同少样本数据量下未见语言的跨语言表现、克服灾难性遗忘的能力以及训练效率。研究发现，当训练数据较为有限时，音素监督相较于子词监督和自监督能取得更好的效果，从而提供更高的数据效率。为支持可复现性并推动该方向的后续研究，我们在https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10 发布了Whistle完整流程的代码、模型与数据。