The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework further by proposing methods to perform speaker-attributed transcription with SURT, for both short mixtures and long recordings. We achieve this by adding an auxiliary speaker branch to SURT, and synchronizing its label prediction with ASR token prediction through HAT-style blank factorization. In order to ensure consistency in relative speaker labels across different utterance groups in a recording, we propose "speaker prefixing" -- appending each chunk with high-confidence frames of speakers identified in previous chunks, to establish the relative order. We perform extensive ablation experiments on synthetic LibriSpeech mixtures to validate our design choices, and demonstrate the efficacy of our final model on the AMI corpus.
翻译:流式解混与识别转换器(SURT)近期已成为连续流式多说话人语音识别(ASR)领域的流行框架。通过架构、目标函数及混合模拟方法的改进,实验证明SURT可作为一种高效流式方法用于真实会议场景中与说话人无关的语音转写。本研究进一步推进该框架,提出利用SURT实现说话人归因转写的方法,适用于短时混合语音与长时录音。具体而言,我们通过向SURT添加辅助说话人分支,并采用HAT风格空白因子分解机制,同步其标签预测与ASR令牌预测。为保障录音中不同话语组间相对说话人标签的一致性,我们提出"说话人前缀法"——将前序片段中高置信度的说话人身份帧附加至当前处理块,从而建立相对顺序。基于合成LibriSpeech混合数据开展全面的消融实验以验证设计选择,并在AMI语料库上证明最终模型的有效性。