Bridge models have been investigated in speech enhancement but are mostly single-task, with constrained general speech restoration (GSR) capability. In this work, we propose VoiceBridge, a one-step latent bridge model (LBM) for GSR, capable of efficiently reconstructing 48 kHz fullband speech from diverse distortions. To inherit the advantages of data-domain bridge models, we design an energy-preserving variational autoencoder, enhancing the waveform-latent space alignment over varying energy levels. By compressing waveform into continuous latent representations, VoiceBridge models~\textit{various} GSR tasks with a~\textit{single} latent-to-latent generative process backed by a scalable transformer. To alleviate the challenge of reconstructing the high-quality target from distinctively different low-quality priors, we propose a joint neural prior for GSR, uniformly reducing the burden of the LBM in diverse tasks. Building upon these designs, we further investigate bridge training objective by jointly tuning LBM, decoder and discriminator together, transforming the model from a denoiser to generator and enabling \textit{one-step GSR without distillation}. Extensive validation across in-domain (\textit{e.g.}, denoising and super-resolution) and out-of-domain tasks (\textit{e.g.}, refining synthesized speech) and datasets demonstrates the superior performance of VoiceBridge. Demos: https://VoiceBridgedemo.github.io/.
翻译:桥模型在语音增强领域已有研究,但多为单任务模型,其通用语音修复能力受限。本文提出VoiceBridge,一种用于通用语音修复的一步潜在桥模型,能够高效地从多种失真中重建48 kHz全频带语音。为继承数据域桥模型的优势,我们设计了一种能量保持变分自编码器,增强了不同能量水平下波形与潜在空间的对齐。通过将波形压缩为连续潜在表示,VoiceBridge利用可扩展Transformer支撑的单一潜在到潜在生成过程,对多种通用语音修复任务进行建模。为缓解从显著不同的低质量先验中重建高质量目标的挑战,我们提出了一种用于通用语音修复的联合神经先验,统一减轻了潜在桥模型在多样任务中的负担。基于这些设计,我们进一步研究了桥训练目标,通过联合调优潜在桥模型、解码器和判别器,将模型从降噪器转变为生成器,并实现无需蒸馏的一步通用语音修复。在领域内任务(如降噪和超分辨率)和领域外任务(如优化合成语音)及多个数据集上的广泛验证,证明了VoiceBridge的卓越性能。演示:https://VoiceBridgedemo.github.io/。