Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation. This paper proposes (1) a new method for creating code-switching ASR datasets from purely monolingual data sources, and (2) a novel Concatenated Tokenizer that enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers. The efficacy of these approaches for building CS ASR models is demonstrated for two language pairs, English-Hindi and English-Spanish, where we achieve new state-of-the-art results on the Miami Bangor CS evaluation corpus. In addition to competitive ASR performance, the proposed Concatenated Tokenizer models are highly effective for spoken language identification, achieving 98%+ accuracy on the out-of-distribution FLEURS dataset.
翻译:语种混用多语言自动语音识别模型能够转录对话中交替使用两种或多种语言的语音。本文提出:(1) 一种从纯单语数据源创建语种混用语音识别数据集的新方法;(2) 一种新颖的级联分词器,使语音识别模型能在复用现有多语种分词器的同时为每个输出文本标记生成语言标识。我们通过英语-印地语和英语-西班牙语两种语言对验证了这些方法构建语种混用语音识别模型的有效性,在迈阿密邦戈语种混用评估语料库上取得了新的最优结果。除具备竞争力的语音识别性能外,所提出的级联分词器模型在口语语言识别任务中表现优异,在分布外FLEURS数据集上实现了98%以上的准确率。