Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.
翻译:多说话人语音识别与说话人日志的联合建模近期在说话人属性语音识别(SA-ASR)中展现出良好前景。尽管能够获得最先进(SOTA)的性能,但多数研究基于自回归(AR)解码器,该解码器逐词生成标记,导致较大的实时因子(RTF)。为加速推理,我们引入最近提出的非自回归模型Paraformer作为SA-ASR模型中的声学模型。Paraformer采用单步解码器实现并行生成,其性能可与SOTA自回归Transformer模型相媲美。此外,我们提出说话人填充策略以减少说话人识别错误,并采用CTC间策略增强编码器在声学建模中的能力。在AliMeeting语料库上的实验表明,与级联SA-ASR模型相比,我们的模型在测试集上实现了6.1%的说话人相关字符错误率(SD-CER)相对降低。而且,与SOTA联合自回归SA-ASR模型相比,我们的模型在仅1/10的RTF下达到了34.8%的可比SD-CER。