Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers. Based on the transducer model structure, we propose four methods, a unified joint and prediction network for multilingual output, a clustered multilingual encoder, target language identification for encoder, and connectionist temporal classification regularization. Experimental results show that LAMASSU not only drastically reduces the model size but also reaches the performances of monolingual ASR and bilingual ST models.
翻译:自动语音识别(ASR)和语音翻译(ST)均可采用神经换能器作为模型结构。因此,使用单一换能器模型同时执行这两项任务是可行的。在实际应用中,这种联合ASR与ST模型可能需要具备流式处理能力,且无需源语言识别(即语言无关)。本文提出LAMASSU——一种基于神经换能器的流式语言无关多语言语音识别与翻译模型。基于换能器模型结构,我们提出了四种方法:统一的多语言输出的联合与预测网络、聚类多语言编码器、编码器目标语言识别,以及连接时序分类正则化。实验结果表明,LAMASSU不仅大幅缩减了模型规模,还达到了单语ASR与双语ST模型的性能水平。