Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker's appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers' speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.
翻译:将RNN Transducer(RNNT)扩展至多说话人语音识别对于拓宽自动语音识别(ASR)的应用范围至关重要。多说话人RNNT(MT-RNNT)旨在不依赖昂贵的前端声源分离技术实现识别。传统上,MT-RNNT通过采用多编码器或多解码器架构,或将所有说话人的文本序列化为单一输出流来实现。第一种方法计算成本高昂,尤其是由于需要多个编码器处理。相比之下,第二种方法涉及复杂的标签生成过程,需要从外部ASR系统获取混合语音中所有说话人所有词语的精确时间戳。本文提出了一种针对MT-RNNT的新型免对齐训练方案(MT-RNNT-AFT),该方案采用标准RNNT架构。目标标签通过在文本开头附加对应每个说话人的提示符来创建,以反映各说话人在混合语音中出现的顺序。因此,MT-RNNT-AFT无需依赖精确对齐即可训练,并且仅需一轮编码器处理即可识别所有说话人的语音。实验表明,MT-RNNT-AFT在实现与当前最先进替代方案相当性能的同时,极大地简化了训练过程。