In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with different intensities. Furthermore, speech distortion and residual noise are often observed in enhanced speech, and the distortion of speech and noise is different. Most existing methods focus on fusing enhanced and noisy features to address this issue. In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. Our proposed method can achieve better performance with a relative 8.6% CER reduction.
翻译:近年来,语音增强前端与自动语音识别(ASR)后端的联合训练被广泛用于提升ASR系统的鲁棒性。传统联合训练方法仅将增强后的语音作为后端输入。然而,由于噪声类型多样且强度各异,语音增强系统难以直接从输入中分离出语音。此外,增强后的语音中常出现语音失真和残留噪声,且语音与噪声的失真模式存在差异。现有方法多数集中于融合增强特征与噪声特征以解决该问题。本文提出一种双频谱图精炼网络,旨在同时精炼语音与噪声并从含噪输入中解耦噪声。所提方法可实现相对8.6%的字符错误率(CER)降低。