The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions. This has made offering translations in multiple languages essential for user applications. Traditional approaches to automatic speech recognition (ASR) and speech translation (ST) have often relied on separate systems, leading to inefficiencies in computational resources, and increased synchronization complexity in real time. In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. We introduce a novel method for joint token-level serialized output training based on timestamp information to effectively produce ASR and ST outputs in the streaming setting. Experiments on {it,es,de}->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
翻译:随着全球交流与跨语言交互的日益频繁,即时语音转录与翻译的需求持续增长。这使得在用户应用中提供多语言翻译变得至关重要。传统的自动语音识别(ASR)和语音翻译(ST)方法通常依赖独立系统,导致计算资源利用率低,实时同步复杂度高。本文提出一种流式Transformer-Transducer(T-T)模型,能够通过单一解码器联合实现多对一与一对多的转录及翻译。我们引入了一种基于时间戳信息的联合Token级串行化输出训练方法,在流式场景中高效生成ASR和ST输出。在{it,es,de}->en上的实验证明了该方法的有效性,首次实现了单一解码器生成一对多联合输出。