Improving Massively Multilingual ASR With Auxiliary CTC Objectives

Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety of languages. With how many languages these models have to handle, however, a key to understanding their imbalanced performance across different languages is to examine if the model actually knows which language it should transcribe. In this paper, we introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark, by conditioning the entire model on language identity (LID). We investigate techniques inspired from recent Connectionist Temporal Classification (CTC) studies to help the model handle the large number of languages, conditioning on the LID predictions of auxiliary tasks. Our experimental results demonstrate the effectiveness of our technique over standard CTC/Attention-based hybrid models. Furthermore, our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER. Trained models and reproducible recipes are available at https://github.com/espnet/espnet/tree/master/egs2/fleurs/asr1 .

翻译：多语言自动语音识别（ASR）模型将语音技术的可用性扩展到了多种语言。然而，考虑到这些模型需要处理的语言数量，理解它们在不同语言上性能不平衡的关键在于检查模型是否真正知道它应该转录哪种语言。在本文中，我们介绍了通过在整体模型上（conditioning）基于语言身份（LID）来提升FLEURS（一个102语言开放ASR基准）性能的工作。我们借鉴了近期连接主义时间分类（CTC）研究中的技术，通过依赖辅助任务的LID预测来帮助模型处理大量语言。我们的实验结果表明，该技术优于标准的CTC/注意力混合模型。此外，我们使用基于Conformer架构的自监督模型构建的最新系统，在FLEURS上将先前工作的结果相对提升了28.4%的字符错误率（CER）。训练好的模型和可复现的实验配置可在https://github.com/espnet/espnet/tree/master/egs2/fleurs/asr1 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/