Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us to use the further complementarity between predictive and diffusion-based generative SE. In this paper, we propose a unified system that use jointly generative and predictive decoders across two levels. The encoder encodes both generative and predictive information at the shared encoding level. At the decoded feature level, we fuse the two decoded features by generative and predictive decoders. Specifically, the two SE modules are fused in the initial and final diffusion steps: the initial fusion initializes the diffusion process with the predictive SE to improve convergence, and the final fusion combines the two complementary SE outputs to enhance SE performance. Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores compared with other score-based diffusion SE (StoRM and SGMSE+).
翻译:基于扩散的生成式语音增强技术近来受到关注,但逆向扩散过程仍耗时较长。一种解决方案是利用预测式语音增强系统估计的增强特征来初始化逆向扩散过程。然而,现有流水线结构并未考虑生成式与预测式解码器的联合使用。预测式解码器使我们能够进一步利用预测式与基于扩散的生成式语音增强之间的互补性。本文提出一个统一系统,在两个层级上联合使用生成式与预测式解码器。编码器在共享编码层同时编码生成式与预测式信息;在解码特征层,我们将生成式解码器与预测式解码器提取的两类解码特征进行融合。具体而言,两个语音增强模块在初始和最终扩散步骤中融合:初始融合利用预测式语音增强初始化扩散过程以加速收敛,最终融合则将两类互补的语音增强输出相结合以提升增强性能。在Voice-Bank数据集上的实验表明,与其他基于评分的扩散语音增强方法(StoRM和SGMSE+)相比,引入预测式信息可实现更快的解码速度和更高的PESQ评分。