Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper proposes a novel parallel generative speech enhancement (ParaGSE) framework that leverages a group vector quantization (GVQ)-based neural speech codec. The GVQ-based codec adopts separate VQs to produce mutually independent tokens, enabling efficient parallel token prediction in ParaGSE. Specifically, ParaGSE leverages the GVQ-based codec to encode degraded speech into distinct tokens, predicts the corresponding clean tokens through parallel branches conditioned on degraded spectral features, and ultimately reconstructs clean speech via the codec decoder. Experimental results demonstrate that ParaGSE consistently produces superior enhanced speech compared to both discriminative and generative baselines, under a wide range of distortions including noise, reverberation, band-limiting, and their mixtures. Furthermore, empowered by parallel computation in token prediction, ParaGSE attains about a 1.5-fold improvement in generation efficiency on CPU compared with serial generative speech enhancement approaches.
翻译:近年来,生成式语音增强技术引起了广泛关注;然而,现有方法普遍受限于过高的复杂度、有限的效率以及欠佳的语音质量。为克服这些挑战,本文提出了一种新颖的并行生成式语音增强框架,该框架利用了一种基于分组向量量化的神经语音编解码器。该基于GVQ的编解码器采用独立的向量量化器来生成相互独立的标记,从而使得ParaGSE能够实现高效的并行标记预测。具体而言,ParaGSE利用基于GVQ的编解码器将受损语音编码为不同的标记,通过以受损频谱特征为条件的并行分支预测相应的干净语音标记,并最终通过编解码器的解码器重建出干净语音。实验结果表明,在包括噪声、混响、带宽限制及其混合在内的多种失真条件下,与判别式和生成式基线方法相比,ParaGSE始终能产生更优的增强语音。此外,得益于标记预测中的并行计算,与串行生成式语音增强方法相比,ParaGSE在CPU上的生成效率提升了约1.5倍。