Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

Target speaker extraction (TSE) aims to recover the speech signal of a desired speaker from a mixed audio recording, given a short enrollment utterance. Most existing TSE approaches are based on discriminative modeling paradigms. Although effective at suppressing interfering speakers, these methods often struggle to produce speech with high perceptual quality and naturalness. To address this limitation, we first propose LauraTSE, a generative TSE model built upon an auto-regressive decoder-only language model. However, purely generative approaches may suffer from hallucinations, content drift, and limited controllability, which may undermine their reliability in complex acoustic scenarios. To overcome these challenges, we further introduce a discriminative-generative TSE framework. In this framework, a discriminative front-end is employed to robustly extract the target speaker's speech, yielding stable and controllable intermediate representations. A generative back-end then operates in the neural audio codec representation space to reconstruct fine-grained speech details and enhance perceptual quality. This two-stage design effectively combines the robustness and controllability of discriminative models with the superior naturalness and quality enhancement capabilities of generative models. Moreover, we systematically investigate collaborative training strategies for the proposed framework, including freezing or fine-tuning the front-end, incorporating an auxiliary SI-SDR loss, and exploring both auto-regressive and non-auto-regressive inference mechanisms. Experimental results demonstrate that the proposed framework achieves a more favorable trade-off among speech quality, intelligibility, and speaker consistency.

翻译：目标说话人提取旨在通过给定的短时注册语音，从混合音频录音中恢复目标说话人的语音信号。现有的大多数TSE方法基于判别式建模范式。尽管这些方法在抑制干扰说话人方面效果显著，但往往难以生成具有高感知质量和自然度的语音。为解决这一局限性，我们首先提出了LauraTSE——一种基于自回归仅解码器语言模型的生成式TSE模型。然而，纯生成式方法可能面临幻觉、内容漂移和可控性有限等问题，这些缺陷可能削弱其在复杂声学场景中的可靠性。为克服这些挑战，我们进一步提出了判别-生成式TSE框架。在该框架中，判别式前端被用于稳健地提取目标说话人语音，生成稳定且可控的中间表征；生成式后端则在神经音频编解码表征空间中运作，以重建细粒度语音细节并提升感知质量。这种两阶段设计有效结合了判别式模型的稳健性、可控性以及生成式模型卓越的自然度与质量增强能力。此外，我们系统研究了该框架的协同训练策略，包括冻结或微调前端、引入辅助SI-SDR损失，以及探索自回归与非自回归推理机制。实验结果表明，所提框架在语音质量、可懂度和说话人一致性之间实现了更优的权衡。