Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety of languages. With how many languages these models have to handle, however, a key to understanding their imbalanced performance across different languages is to examine if the model actually knows which language it should transcribe. In this paper, we introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark, by conditioning the entire model on language identity (LID). We investigate techniques inspired from recent Connectionist Temporal Classification (CTC) studies to help the model handle the large number of languages, conditioning on the LID predictions of auxiliary tasks. Our experimental results demonstrate the effectiveness of our technique over standard CTC/Attention-based hybrid models. Furthermore, our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER. Trained models and reproducible recipes are available at https://github.com/espnet/espnet/tree/master/egs2/fleurs/asr1 .
翻译:多语言自动语音识别(ASR)模型将语音技术的可用性扩展到了多种语言。然而,考虑到这些模型需要处理的语言数量,理解它们在不同语言上性能不平衡的关键在于检查模型是否真正知道它应该转录哪种语言。在本文中,我们介绍了通过在整体模型上(conditioning)基于语言身份(LID)来提升FLEURS(一个102语言开放ASR基准)性能的工作。我们借鉴了近期连接主义时间分类(CTC)研究中的技术,通过依赖辅助任务的LID预测来帮助模型处理大量语言。我们的实验结果表明,该技术优于标准的CTC/注意力混合模型。此外,我们使用基于Conformer架构的自监督模型构建的最新系统,在FLEURS上将先前工作的结果相对提升了28.4%的字符错误率(CER)。训练好的模型和可复现的实验配置可在https://github.com/espnet/espnet/tree/master/egs2/fleurs/asr1 获取。