Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with strong interference suppression, and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space. This design combines the controllability of discriminative extraction with the reconstruction capability of generative modeling. We further investigate several collaboration strategies for the two-stage framework, including front-end freezing, joint fine-tuning, SI-SDR regularization, and autoregressive/non-autoregressive inference. Experimental results on both TSE and SE benchmarks show that the proposed framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines.
翻译:目标说话人提取(TSE)旨在从混合语音中,根据一段简短注册语音恢复目标说话人的语音;语音增强(SE)则侧重于在噪声条件下改善语音质量。现有大部分TSE和SE系统基于判别式建模,展现出强大的干扰抑制能力,但在感知质量和自然度方面往往受限。为解决这一问题,我们首先引入LauraTSE,一种基于自回归仅解码器语言模型的生成式TSE模型。尽管生成式建模在质量提升方面前景广阔,但纯生成式TSE在复杂声学环境中可能面临幻觉、内容漂移和可控性有限等问题。为此,我们提出一种判别式-生成式两阶段框架:判别式前端首先生成具有强干扰抑制能力的目标相关表示,然后生成式后端在神经音频编解码表示空间中重建高质量语音。该设计结合了判别式提取的可控性与生成式建模的重建能力。我们进一步研究了两阶段框架的多种协作策略,包括前端冻结、联合微调、SI-SDR正则化以及自回归/非自回归推理。在TSE和SE基准上的实验结果表明,与纯判别式或纯生成式基线相比,该框架在感知质量、可懂度和说话人一致性之间实现了更优的平衡。