The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors; (ii) it is computationally expensive, due to which it has not seen adoption in academia; and (iii) it has only been evaluated on synthetic mixtures. In this work, we propose several modifications to the original SURT which are carefully designed to fix the above limitations. In particular, we (i) change the unmixing module to a mask estimator that uses dual-path modeling, (ii) use a streaming zipformer encoder and a stateless decoder for the transducer, (iii) perform mixture simulation using force-aligned subsegments, (iv) pre-train the transducer on single-speaker data, (v) use auxiliary objectives in the form of masking loss and encoder CTC loss, and (vi) perform domain adaptation for far-field recognition. We show that our modifications allow SURT 2.0 to outperform its predecessor in terms of multi-talker ASR results, while being efficient enough to train with academic resources. We conduct our evaluations on 3 publicly available meeting benchmarks -- LibriCSS, AMI, and ICSI, where our best model achieves WERs of 16.9%, 44.6% and 32.2%, respectively, on far-field unsegmented recordings. We release training recipes and pre-trained models: https://sites.google.com/view/surt2.
翻译:流式解混与识别换能器(SURT)模型近期被提出作为一种用于连续、流式多说话人语音识别(ASR)的端到端方法。尽管在多轮会议场景中取得了显著成果,SURT仍存在明显局限性:(i)存在漏检与插入相关错误;(ii)计算成本高昂,因而未能在学术界得到广泛采用;(iii)仅在合成混合数据上进行了评估。本文针对原始SURT提出若干精心设计的改进以解决上述问题。具体而言,我们:(i)将解混模块替换为采用双路径建模的掩码估计器;(ii)使用流式zipformer编码器及无状态解码器构建换能器;(iii)基于强制对齐的子片段进行混合模拟;(iv)在单说话人数据上预训练换能器;(v)引入掩码损失与编码器CTC损失作为辅助目标;(vi)对远场识别执行领域自适应。实验表明,我们的改进使SURT 2.0在多说话人ASR性能上超越前代模型,同时其训练效率足以支持学术资源环境。我们在三个公开会议基准——LibriCSS、AMI和ICSI上进行评估,最佳模型在远场无分割录音上的词错误率(WER)分别达到16.9%、44.6%和32.2%。我们开源训练方案及预训练模型:https://sites.google.com/view/surt2