With advances in deep learning, neural network based speech enhancement (SE) has developed rapidly in the last decade. Meanwhile, the self-supervised pre-trained model and vector quantization (VQ) have achieved excellent performance on many speech-related tasks, while they are less explored on SE. As it was shown in our previous work that utilizing a VQ module to discretize noisy speech representations is beneficial for speech denoising, in this work we therefore study the impact of using VQ at different layers with different number of codebooks. Different VQ modules indeed enable to extract multiple-granularity speech features. Following an attention mechanism, the contextual features extracted by a pre-trained model are fused with the local features extracted by the encoder, such that both global and local information are preserved to reconstruct the enhanced speech. Experimental results on the Valentini dataset show that the proposed model can improve the SE performance, where the impact of choosing pre-trained models is also revealed.
翻译:随着深度学习的进步,基于神经网络的语音增强技术在近十年中发展迅速。与此同时,自监督预训练模型与向量量化(VQ)在许多语音相关任务中取得了优异性能,但在语音增强领域的应用仍较为有限。本课题组前期工作已表明,利用VQ模块离散化含噪语音表征有利于去噪处理。基于此,本文进一步研究了在不同层数采用不同码本数量的VQ模块的影响。不同VQ模块能够有效提取多粒度语音特征。通过注意力机制,预训练模型提取的上下文特征与编码器提取的局部特征实现融合,从而同时保留全局与局部信息以重建增强语音。在Valentini数据集上的实验结果表明,所提模型可提升语音增强性能,同时揭示了预训练模型选择对性能的影响规律。