In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic Speech Recognition (ASR) have reported state-of-the-art performance on various benchmarks. These systems intrinsically learn how to handle and remove noise conditions from speech. Previous research has shown, that it is possible to extract the denoising capabilities of these models into a preprocessor network, which can be used as a frontend for downstream ASR models. However, the proposed methods were limited to specific fully convolutional architectures. In this work, we propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. We propose the Cleancoder preprocessor architecture that extracts hidden activations from the Conformer ASR model and feeds them to a decoder to predict denoised spectrograms. We train our pre-processor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs. Then, we evaluate our model as a frontend to a pretrained Conformer ASR model as well as a frontend to train smaller Conformer ASR models from scratch. We show that the Cleancoder is able to filter noise from speech and that it improves the total Word Error Rate (WER) of the downstream model in noisy conditions for both applications.
翻译:近期研究显示,在语音处理领域,用于自动语音识别(ASR)的大型端到端(E2E)系统在多个基准测试中取得了最优性能。这些系统内在地学习如何处理并消除语音中的噪声条件。既往研究表明,将此类模型的去噪能力提取为预处理网络是可行的,该网络可作为下游ASR模型的前端模块。然而,已提出的方法局限于特定的全卷积架构。在本工作中,我们提出了一种新颖的方法来提取去噪能力,该方法可适用于任意编码器-解码器架构。我们提出的Cleancoder预处理架构可从Conformer ASR模型中提取隐藏激活值,并将其输入解码器以预测去噪后的语谱图。我们在噪声语音数据库(NSD)上训练该预处理器,使其能够从含噪输入中重构去噪语谱图。随后,我们将此模型分别评估为预训练Conformer ASR模型的前端模块,以及从头训练更小规模Conformer ASR模型的前端模块。实验表明,Cleancoder能有效滤除语音中的噪声,并在两种应用场景中均能降低下游模型在噪声条件下的总词错误率(WER)。