Multilingual Automatic Speech Recognition (ASR) models are capable of transcribing audios across multiple languages, eliminating the need for separate models. In addition, they can perform Language Identification (LID) and handle code-switched speech. However, training these models requires special code-switch and multilingual speech corpora which are sparsely available. In this paper, we evaluate different approaches towards training of bilingual as well as code-switched ASR models using purely monolingual data sources. We introduce the concept of aggregate tokenizers that differs from the current prevalent technique of generating LIDs at the boundaries of monolingual samples and produces LID for each emitted token instead. We compare bilingual and monolingual model performance, showcase the efficacy of aggregate tokenizers, present a synthetic code-switched ASR data generation technique and demonstrate the effectiveness of the proposed code-switched ASR models for the tasks of speech recognition and spoken language identification.
翻译:多语言自动语音识别(ASR)模型能够跨多种语言转录音频,从而消除了对独立模型的需求。此外,它们还能执行语言识别(LID)并处理语码转换语音。然而,训练这些模型需要特殊的语码转换和多语言语音语料库,而这些资源十分稀缺。在本文中,我们评估了仅使用纯单语数据源训练双语及语码转换ASR模型的不同方法。我们引入了聚合分词器(aggregate tokenizers)的概念,这与当前在单语样本边界生成LID的流行技术不同,而是为每个输出的词元(token)生成LID。我们比较了双语和单语模型的性能,展示了聚合分词器的有效性,提出了一种合成语码转换ASR数据生成技术,并证明了所提出的语码转换ASR模型在语音识别和口语语言识别任务中的有效性能。