Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%).
翻译:神经语音分离取得了显著进展,其与自动语音识别(ASR)的融合是实现多说话人ASR的重要方向。本研究深入分析了混响及噪声混响场景中语音分离作为ASR前端的性能。具体而言,我们探究了多通道分离方法、基于掩蔽的波束成形和复频谱映射,以及ASR后端模型中最优特征的使用。我们采用近期提出的自监督学习表示(SSLR)作为特征,相对于滤波器组特征的情况提升了识别性能。为进一步提高多说话人识别性能,我们提出了一种精心设计的训练策略,用于集成基于SSLR的语音分离与识别。所提出的集成方法采用基于TF-GridNet的复频谱映射与基于WavLM的SSLR,在混响WHAMR!测试集上实现了2.5%的词错误率,显著优于现有基于掩蔽的MVDR波束成形与滤波器组集成方法(28.9%)。