In this paper, we propose to extend the deep, complex U-Network architecture for speech enhancement by incorporating a probabilistic (i.e., variational) latent space model. The proposed model is evaluated against several ablated versions of itself in order to study the effects of the variational latent space model, complex-value processing, and self-attention. Evaluation on the MS-DNS 2020 and Voicebank+Demand datasets yields consistently high performance. E.g., the proposed model achieves an SI-SDR of up to 20.2 dB, about 0.5 to 1.4 dB higher than its ablated version without probabilistic latent space, 2-2.4 dB higher than WaveUNet, and 6.7 dB above PHASEN. Compared to real-valued magnitude spectrogram processing with a variational U-Net, the complex U-Net achieves an improvement of up to 4.5 dB SI-SDR. Complex spectrum encoding as magnitude and phase yields best performance in anechoic conditions whereas real and imaginary part representation results in better generalization to (novel) reverberation conditions, possibly due to the underlying physics of sound.
翻译:本文提出通过引入概率(即变分)潜空间模型,对用于语音增强的深度复数U-Net架构进行扩展。为研究变分潜空间模型、复数处理及自注意力机制的影响,我们将所提模型与多个消融版本进行对比评估。在MS-DNS 2020和Voicebank+Demand数据集上的评估结果显示其性能持续优异。例如,所提模型可实现高达20.2 dB的SI-SDR,比不含概率潜空间的消融版本高0.5至1.4 dB,比WaveUNet高2-2.4 dB,比PHASEN高6.7 dB。与使用变分U-Net的实值幅度谱图处理方法相比,复数U-Net在SI-SDR上实现了最高4.5 dB的提升。在消声条件下,采用幅度-相位编码的复数频谱表征可获得最佳性能,而实部-虚部表征则更有利于对(新型)混响条件的泛化,这一差异可能源于声音的底层物理机制。