Despite rapid progress in increasing the language coverage of automatic speech recognition, the field is still far from covering all languages with a known writing script. Recent work showed promising results with a zero-shot approach requiring only a small amount of text data, however, accuracy heavily depends on the quality of the used phonemizer which is often weak for unseen languages. In this paper, we present MMS Zero-shot a conceptually simpler approach based on romanization and an acoustic model trained on data in 1,078 different languages or three orders of magnitude more than prior art. MMS Zero-shot reduces the average character error rate by a relative 46% over 100 unseen languages compared to the best previous work. Moreover, the error rate of our approach is only 2.5x higher compared to in-domain supervised baselines, while our approach uses no labeled data for the evaluation languages at all.
翻译:尽管自动语音识别的语言覆盖范围正在迅速扩大,但该领域仍远未涵盖所有已知书写系统的语言。近期研究展示了一种仅需少量文本数据的零样本方法,取得了令人瞩目的成果,然而其准确度在很大程度上依赖于所用音素转换器的质量,而该转换器对于未见语言通常表现欠佳。本文提出MMS Zero-shot,这是一种基于罗马化且概念上更简单的方法,其声学模型使用1,078种不同语言的数据进行训练,覆盖语言数量比现有技术高出三个数量级。与先前最佳工作相比,MMS Zero-shot在100种未见语言上将平均字符错误率相对降低了46%。此外,我们的方法虽然完全未使用评估语言的任何标注数据,其错误率仅比领域内监督基线的结果高出2.5倍。