Automatic speech recognition (ASR) systems degrade significantly under noisy conditions. Recently, speech enhancement (SE) is introduced as front-end to reduce noise for ASR, but it also suppresses some important speech information, i.e., over-suppression. To alleviate this, we propose a dual-path style learning approach for end-to-end noise-robust speech recognition (DPSL-ASR). Specifically, we first introduce clean speech feature along with the fused feature from IFF-Net as dual-path inputs to recover the suppressed information. Then, we propose style learning to map the fused feature close to clean feature, in order to learn latent speech information from the latter, i.e., clean "speech style". Furthermore, we also minimize the distance of final ASR outputs in two paths to improve noise-robustness. Experiments show that the proposed approach achieves relative word error rate (WER) reductions of 10.6% and 8.6% over the best IFF-Net baseline, on RATS and CHiME-4 datasets respectively.
翻译:自动语音识别(ASR)系统在噪声环境下性能显著下降。近年来,语音增强(SE)被用作前端来降低ASR中的噪声,然而它也会抑制部分重要的语音信息,即过度抑制。为缓解此问题,我们提出了一种双路径风格学习方法用于端到端抗噪语音识别(DPSL-ASR)。具体而言,我们首先引入干净语音特征与来自IFF-Net的融合特征作为双路径输入,以恢复被抑制的信息。随后,我们提出风格学习,将融合特征映射至接近干净特征,从而从后者中学习潜在语音信息,即干净的"语音风格"。此外,我们还最小化两条路径中最终ASR输出的距离,以提升抗噪性。实验表明,所提方法在RATS和CHiME-4数据集上相较于最优IFF-Net基线,分别实现了10.6%和8.6%的相对词错误率(WER)降低。