In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with different intensities. Furthermore, speech distortion and residual noise are often observed in enhanced speech, and the distortion of speech and noise is different. Most existing methods focus on fusing enhanced and noisy features to address this issue. In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. Our proposed method can achieve better performance with a relative 8.6% CER reduction.
翻译:近年来,语音增强前端与自动语音识别(ASR)后端联合训练被广泛用于提升ASR系统的鲁棒性。传统联合训练方法仅将增强后的语音作为后端输入,但由于噪声类型多样且强度各异,语音增强系统难以直接分离输入中的语音成分。此外,增强后的语音常存在语音失真与残留噪声现象,且语音与噪声的失真特性不尽相同。现有方法多通过融合增强特征与带噪特征来应对该问题。本文提出一种双流语谱图精炼网络,同步精炼语音与噪声,并从带噪输入中解耦噪声。实验表明,所提方法可实现相对8.6%的字错误率(CER)降低。