The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.
翻译:近期提出的序列化输出训练(SOT)通过生成由特殊标记分隔的说话人转录文本,简化了多说话人自动语音识别(ASR)任务。然而,频繁的说话人切换会导致说话人变化预测困难。为此,我们提出边界感知序列化输出训练(BA-SOT),通过引入说话人变化检测任务和边界约束损失,显式地将边界知识融入解码器。我们还提出了一种两阶段连接时序分类(CTC)策略,结合令牌级SOT CTC以恢复时间上下文信息。除传统的字符错误率(CER)外,我们引入依赖于话语的字符错误率(UD-CER)以进一步衡量说话人变化预测的精确度。与原始SOT相比,BA-SOT将CER/UD-CER分别降低了5.1%/14.0%,且利用预训练ASR模型初始化BA-SOT模型可进一步将CER/UD-CER分别降低8.4%/19.9%。