In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case.
翻译:在现实应用中,用户通常需要同时获取语音的翻译和转录以增强理解,尤其在需要增量生成的流式场景中。本文提出一种流式Transformer-Transducer架构,通过单一解码器联合生成自动语音识别(ASR)和语音翻译(ST)输出。为高效低延迟地生成ASR与ST内容,我们提出一种联合Token级序列化输出训练方法,该方法利用现成的文本对齐器交错排列源语言与目标语言单词。在单语言(意-英)和多语言(\{德,西,意\}-英)场景下的实验表明,我们的方法实现了最优的质量-延迟平衡。在ASR平均延迟1秒、ST延迟1.3秒的情况下,与独立ASR和ST模型相比,该模型在输出质量上无退化甚至有所提升:多语言环境下平均词错误率改善1.1个点,BLEU值提升0.4个点。