Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers. Based on the transducer model structure, we propose four methods, a unified joint and prediction network for multilingual output, a clustered multilingual encoder, target language identification for encoder, and connectionist temporal classification regularization. Experimental results show that LAMASSU not only drastically reduces the model size but also reaches the performances of monolingual ASR and bilingual ST models.
翻译:自动语音识别(ASR)和语音翻译(ST)均可采用神经传感器作为模型结构,因此使用单一传感器模型同时执行这两项任务是可行的。在实际应用中,此类联合ASR与ST模型需具备流式处理能力,且无需进行源语言识别(即语言无关)。本文提出LAMASSU,一种基于神经传感器的流式语言无关多语种语音识别与翻译模型。基于传感器模型结构,我们提出了四种方法:用于多语种输出的统一联合与预测网络、聚类多语种编码器、面向编码器的目标语言识别,以及连接时序分类正则化。实验结果表明,LAMASSU不仅显著减小了模型规模,还达到了单语ASR和双语ST模型的性能水平。