It has been shown that the intelligibility of noisy speech can be improved by speech enhancement algorithms. However, speech enhancement has not been established as an effective frontend for robust automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between speech enhancement and ASR impedes the progress of robust ASR systems especially as speech enhancement has made big strides in recent years. In this work, we focus on eliminating this divide with an ARN (attentive recurrent network) based time-domain enhancement model. The proposed system fully decouples speech enhancement and an acoustic model trained only on clean speech. Results on the CHiME-2 corpus show that ARN enhanced speech translates to improved ASR results. The proposed system achieves $6.28\%$ average word error rate, outperforming the previous best by $19.3\%$ relatively.
翻译:已有研究表明,语音增强算法能提升带噪语音的可懂度。然而与直接在带噪语音上训练的自动语音识别(ASR)模型相比,语音增强在嘈杂环境中作为鲁棒ASR的有效前端尚未得到充分验证。语音增强与ASR之间的鸿沟阻碍了鲁棒ASR系统的进展,尤其在近年语音增强取得重大突破的背景下。本研究致力于通过基于注意力循环网络(ARN)的时域增强模型消除这一隔阂。所提出的系统将语音增强与仅基于纯净语音训练的声学模型完全解耦。在CHiME-2语料库上的实验结果表明,经ARN增强的语音可有效提升ASR性能。该系统实现了$6.28\%$的平均词错误率,相对此前最优结果提升了$19.3\%$。