Real-world audio recordings often contain multiple speakers and various degradations, which limit both the quantity and quality of speech data available for building state-of-the-art speech processing models. Although end-to-end approaches that concatenate speech enhancement (SE) and speech separation (SS) to obtain a clean speech signal for each speaker are promising, conventional SE-SS methods suffer from complex degradations beyond additive noise. To this end, we propose \textbf{Geneses}, a generative framework to achieve unified, high-quality SE--SS. Our Geneses leverages latent flow matching to estimate each speaker's clean speech features using multi-modal diffusion Transformer conditioned on self-supervised learning representation from noisy mixture. We conduct experimental evaluation using two-speaker mixtures from LibriTTS-R under two conditions: additive-noise-only and complex degradations. The results demonstrate that Geneses significantly outperforms a conventional mask-based SE--SS method across various objective metrics with high robustness against complex degradations. Audio samples are available in our demo page.
翻译:现实世界中的音频录音通常包含多个说话者和各种退化因素,这限制了可用于构建最先进语音处理模型的语音数据的数量和质量。尽管将语音增强(SE)和语音分离(SS)级联以获取每个说话者清晰语音信号的端到端方法前景广阔,但传统的SE-SS方法在处理加性噪声之外的复杂退化时效果不佳。为此,我们提出了\textbf{Geneses},一个生成式框架,旨在实现统一、高质量的SE-SS。我们的Geneses利用潜在流匹配,通过基于自监督学习表示(从含噪混合信号中提取)的多模态扩散Transformer,来估计每个说话者的清晰语音特征。我们使用LibriTTS-R中的双说话者混合语音,在两种条件下进行了实验评估:仅含加性噪声和复杂退化。结果表明,在各种客观指标上,Geneses均显著优于传统的基于掩码的SE-SS方法,并对复杂退化具有高鲁棒性。音频样本可在我们的演示页面获取。