We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end speech recognition, combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries during decoding. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS method, we achieved first place in the realistic single-channel track of the CHiME-8 NOTSOFAR-1 challenge. We also perform the evaluation on the open LibriCSS dataset, achieving a new state-of-the-art performance on single-channel speech recognition.
翻译:我们提出了一种用于后端语音识别的单通道深度级联融合说话人日志与语音分离框架,该框架结合了神经说话人日志与语音分离技术。首先,我们在一个联合训练框架内顺序集成NSD和SS模块,使分离模块能够有效利用来自日志模块的说话人时间边界。随后,为补充DCF-DS的训练,我们引入了一种窗口级解码方案,使DCF-DS框架能够处理稀疏数据收敛不稳定问题。我们还探索了使用在真实数据集上训练的NSD系统,以在解码过程中提供更准确的说话人边界。此外,我们在DCF-DS框架内集成了一个可选的多输入多输出语音增强模块,该模块可带来进一步的性能提升。最后,我们通过对DCF-DS输出进行重新聚类来增强日志结果,从而提高了自动语音识别的准确率。通过采用DCF-DS方法,我们在CHiME-8 NOTSOFAR-1挑战赛的真实单通道赛道中获得了第一名。我们还在开放的LibriCSS数据集上进行了评估,在单通道语音识别任务上取得了新的最先进性能。