Diffusion-based speech enhancement (SE) has been investigated recently, but its decoding is very time-consuming. One solution is to initialize the decoding process with the enhanced feature estimated by a predictive SE system. However, this two-stage method ignores the complementarity between predictive and diffusion SE. In this paper, we propose a unified system that integrates these two SE modules. The system encodes both generative and predictive information, and then applies both generative and predictive decoders, whose outputs are fused. Specifically, the two SE modules are fused in the first and final diffusion steps: the first step fusion initializes the diffusion process with the predictive SE for improving the convergence, and the final step fusion combines the two complementary SE outputs to improve the SE performance. Experiments on the Voice-Bank dataset show that the diffusion score estimation can benefit from the predictive information and speed up the decoding.
翻译:基于扩散的语音增强(SE)方法近年来受到研究关注,但其解码过程极为耗时。一种解决方案是使用预测性SE系统估计的增强特征初始化扩散解码过程。然而,这种两阶段方法忽略了预测性SE与扩散SE之间的互补性。本文提出一个集成这两种SE模块的统一系统。该系统编码生成与预测双重信息,并同时应用生成解码器与预测解码器,其输出被融合处理。具体而言,两个SE模块在扩散过程的首步与末步实现融合:首步融合通过预测性SE初始化扩散过程以加速收敛,末步融合则结合两种互补的SE输出以提升增强性能。在Voice-Bank数据集上的实验表明,扩散得分估计可从预测信息中受益,并有效加速解码过程。