Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition. The code is available at https://github.com/kjw11/Speaker-Aware-CTC.
翻译:多说话人语音识别面临解耦和转录重叠语音的独特挑战。为应对这些挑战,本文研究了连接时序分类在结合序列化输出训练用于多说话人语音识别时对说话人解耦的作用。可视化结果表明,CTC引导编码器在声学嵌入的不同时间区域表征不同说话人。基于此发现,我们提出一种基于贝叶斯风险CTC框架的新型说话人感知CTC训练目标。SACTC是专为多说话人场景设计的CTC变体,通过约束编码器在特定时间帧表征不同说话人的标记,显式建模说话人解耦机制。当与SOT结合时,SOT-SACTC模型在不同重叠程度的语音上均持续优于标准SOT-CTC。具体而言,我们观察到整体词错误率相对降低10%,在低重叠语音上降低15%。本研究代表了基于CTC增强多说话人语音识别任务的初步探索,为多说话人语音识别中的说话人解耦问题提供了新视角。代码已发布于https://github.com/kjw11/Speaker-Aware-CTC。