In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.
翻译:本文介绍了我们构建的流式多语种语音模型(SM2),该模型能够将多种口语转录或翻译为目标语言的文本。SM2的骨干网络采用Transformer Transducer,具备高流式处理能力。与依赖人工标注的语音翻译(ST)数据不同,SM2模型使用弱监督数据训练,这些数据通过机器翻译服务将语音识别语料库中的转录文本转换生成。基于来自25种语言的35.1万小时匿名语音训练数据,SM2模型在语音翻译质量上达到了甚至超过了近期一些流行的大规模非流式语音模型。更重要的是,我们证明了SM2在扩展到新目标语言时具有真正的零样本能力,能够针对训练中未见的{源语音-目标文本}对生成高质量的语音翻译结果。