Target Speech Extraction with Pre-trained Self-supervised Learning Models

Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model including the SSL model parameters.

翻译：预训练自监督学习（SSL）模型在多种语音任务中取得了显著成功。然而，其在目标语音提取（TSE）中的潜力尚未得到充分利用。TSE旨在从混合信号中提取由注册语音引导的目标说话人的语音。我们在TSE框架中利用预训练SSL模型实现两个目标，即处理输入混合信号以及从注册语音中提取说话人嵌入。本文聚焦于如何有效利用SSL模型进行TSE。首先，我们按照SUPERB原则引入一种新颖的TSE下游任务。这一简单实验展示了SSL模型在TSE中的潜力，但其提取性能仍远低于当前最先进水平。随后，我们通过集成两个基于SSL的模块（自适应输入增强器（AIE）和说话人编码器）扩展了一种强大的TSE架构。具体而言，所提出的AIE通过渐进式上采样调整CNN编码器和Transformer模块的时间分辨率，利用CNN编码器的中间表示，同时捕捉细粒度与层次化特征。我们的方法在LibriMix数据集上以14.0 dB的SI-SDR改进超越了当前TSE系统。此外，通过微调包含SSL模型参数的整个模型，我们可进一步将性能提升0.7 dB。

相关内容

TSE

关注 0

IEEE软件工程事务处理对定义明确的理论结果和对软件的构建、分析或管理有潜在影响的实证研究感兴趣。这些交易的范围从制定原则的机制到将这些原则应用到具体环境。具体的主题领域包括：a）开发和维护方法和模型，例如软件系统的规范、设计和实现的技术和原则，包括符号和过程模型；b）评估方法，例如软件测试和验证、可靠性模型、测试和诊断程序，用于错误控制的软件冗余和设计，以及过程和产品各个方面的测量和评估；c）软件项目管理，例如生产力因素、成本模型、进度和组织问题、标准；d）工具和环境，例如特定工具，集成工具环境，包括相关的体系结构、数据库、并行和分布式处理问题；e）系统问题，例如硬件-软件权衡；f）最新调查，提供对某一特定关注领域历史发展的综合和全面审查。官网地址：http://dblp.uni-trier.de/db/journals/tse/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日