Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
翻译:近年来,基于神经网络的单通道语音增强方法受到广泛关注。特别是基于掩码的架构相比传统方法取得了显著的性能提升。本文提出了一种用于基于掩码的端到端神经网络语音增强的多尺度自编码器(MSAE)。MSAE在多个带限分支中对输入波形进行频谱分解,每个分支以不同的速率和尺度运行,从而提取一系列多尺度嵌入。该框架具有直观的自编码器参数化特性,包括基于常数Q变换的灵活频谱带设计。此外,MSAE完全由可微算子构建,使其能够嵌入端到端神经网络中并进行判别性训练。MSAE的设计灵感既来源于近年提出的多尺度网络拓扑结构,也来源于语音处理中传统的多分辨率变换方法。实验结果表明,相比传统的单分支自编码器,MSAE在性能上具有明显优势。此外,所提出的框架在客观语音质量指标和自动语音识别准确率方面均优于多种当前最先进的增强系统。