Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with strong interference suppression, and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space. This design combines the controllability of discriminative extraction with the reconstruction capability of generative modeling. We further investigate several collaboration strategies for the two-stage framework, including front-end freezing, joint fine-tuning, SI-SDR regularization, and autoregressive/non-autoregressive inference. Experimental results on both TSE and SE benchmarks show that the proposed framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines.

翻译：目标说话人提取（TSE）旨在从混合语音中，根据一段简短注册语音恢复目标说话人的语音；语音增强（SE）则侧重于在噪声条件下改善语音质量。现有大部分TSE和SE系统基于判别式建模，展现出强大的干扰抑制能力，但在感知质量和自然度方面往往受限。为解决这一问题，我们首先引入LauraTSE，一种基于自回归仅解码器语言模型的生成式TSE模型。尽管生成式建模在质量提升方面前景广阔，但纯生成式TSE在复杂声学环境中可能面临幻觉、内容漂移和可控性有限等问题。为此，我们提出一种判别式-生成式两阶段框架：判别式前端首先生成具有强干扰抑制能力的目标相关表示，然后生成式后端在神经音频编解码表示空间中重建高质量语音。该设计结合了判别式提取的可控性与生成式建模的重建能力。我们进一步研究了两阶段框架的多种协作策略，包括前端冻结、联合微调、SI-SDR正则化以及自回归/非自回归推理。在TSE和SE基准上的实验结果表明，与纯判别式或纯生成式基线相比，该框架在感知质量、可懂度和说话人一致性之间实现了更优的平衡。

相关内容

TSE

关注 0

IEEE软件工程事务处理对定义明确的理论结果和对软件的构建、分析或管理有潜在影响的实证研究感兴趣。这些交易的范围从制定原则的机制到将这些原则应用到具体环境。具体的主题领域包括：a）开发和维护方法和模型，例如软件系统的规范、设计和实现的技术和原则，包括符号和过程模型；b）评估方法，例如软件测试和验证、可靠性模型、测试和诊断程序，用于错误控制的软件冗余和设计，以及过程和产品各个方面的测量和评估；c）软件项目管理，例如生产力因素、成本模型、进度和组织问题、标准；d）工具和环境，例如特定工具，集成工具环境，包括相关的体系结构、数据库、并行和分布式处理问题；e）系统问题，例如硬件-软件权衡；f）最新调查，提供对某一特定关注领域历史发展的综合和全面审查。官网地址：http://dblp.uni-trier.de/db/journals/tse/

【普林斯顿博士论文】用于语音的生成式通用模型

专知会员服务

19+阅读 · 2025年12月3日

《口语语言模型研究现状：一项全面综述》

专知会员服务

16+阅读 · 2025年4月14日

《使用生成式大语言模型进行多语言事件提取》最新85页

专知会员服务

24+阅读 · 2025年2月16日

大型语言模型在不同自然语言处理任务中的提示工程方法综述

专知会员服务

60+阅读 · 2024年7月21日