Recent work has shown that it is possible to train an $\textit{unsupervised}$ automatic speech recognition (ASR) system using only unpaired audio and text. Existing unsupervised ASR methods assume that no labeled data can be used for training. We argue that even if one does not have any labeled audio for a given language, there is $\textit{always}$ labeled data available for other languages. We show that it is possible to use character-level acoustic models (AMs) from other languages to bootstrap an $\textit{unsupervised}$ AM in a new language. Here, "unsupervised" means no labeled audio is available for the $\textit{target}$ language. Our approach is based on two key ingredients: (i) generating pseudo-labels (PLs) of the $\textit{target}$ language using some $\textit{other}$ language AM and (ii) constraining these PLs with a $\textit{target language model}$. Our approach is effective on Common Voice: e.g. transfer of English AM to Swahili achieves 18% WER. It also outperforms character-based wav2vec-U 2.0 by 15% absolute WER on LJSpeech with 800h of labeled German data instead of 60k hours of unlabeled English data.
翻译:近期研究表明,仅使用未配对音频和文本即可训练$\textit{无监督}$自动语音识别(ASR)系统。现有无监督ASR方法假设训练过程中无法使用任何标注数据。但我们认为,即使针对特定语言缺乏标注音频,其他语言中$\textit{始终}$存在可用标注数据。本文证明,可利用其他语言的字符级声学模型(AM)为新型语言引导构建$\textit{无监督}$声学模型。此处的"无监督"特指$\textit{目标}$语言没有可用标注音频。本方法基于两个关键要素:(i)利用$\textit{其他}$语言声学模型生成目标语言的伪标注(PLs),(ii)通过$\textit{目标语言模型}$约束这些伪标注。该方法在Common Voice语料库上表现优异:例如将英语声学模型迁移至斯瓦希里语时,词错误率(WER)达到18%。此外,在LJSpeech数据集上,使用800小时标注德语数据(而非6万小时无标注英语数据)时,该方法在字符级wav2vec-U 2.0基础上实现WER绝对值降低15%。