Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR

Target speech extraction (TSE) isolates the speech of a specific speaker from a multi-talker overlapped speech mixture. Most existing TSE models rely on discriminative methods, typically predicting a time-frequency spectrogram mask for the target speech. However, imperfections in these masks often result in over-/under-suppression of target/non-target speech, degrading perceptual quality. Generative methods, by contrast, re-synthesize target speech based on the mixture and target speaker cues, achieving superior perceptual quality. Nevertheless, these methods often overlook speech intelligibility, leading to alterations or loss of semantic content in the re-synthesized speech. Inspired by the Whisper model's success in target speaker ASR, we propose a generative TSE framework based on the pre-trained Whisper model to address the above issues. This framework integrates semantic modeling with flow-based acoustic modeling to achieve both high intelligibility and perceptual quality. Results from multiple benchmarks demonstrate that the proposed method outperforms existing generative and discriminative baselines. We present speech samples on our demo page.

翻译：目标语音提取（TSE）旨在从多人重叠的语音混合中分离出特定说话人的语音。现有TSE模型大多采用判别式方法，通常通过预测目标语音的时频谱掩码来实现。然而，这些掩码的缺陷常导致目标语音的过抑制或非目标语音的欠抑制，从而降低感知质量。相比之下，生成式方法基于混合语音与目标说话人线索重新合成目标语音，能获得更优的感知质量。但此类方法往往忽视语音可懂度，导致重合成语音的语义内容发生畸变或丢失。受Whisper模型在目标说话人自动语音识别中成功的启发，我们提出一种基于预训练Whisper模型的生成式TSE框架以解决上述问题。该框架将语义建模与基于流的声学建模相结合，同时实现高可懂度与高感知质量。多基准测试结果表明，所提方法优于现有生成式与判别式基线模型。相关语音样本已发布于演示页面。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日