Target speech extraction (TSE) isolates the speech of a specific speaker from a multi-talker overlapped speech mixture. Most existing TSE models rely on discriminative methods, typically predicting a time-frequency spectrogram mask for the target speech. However, imperfections in these masks often result in over-/under-suppression of target/non-target speech, degrading perceptual quality. Generative methods, by contrast, re-synthesize target speech based on the mixture and target speaker cues, achieving superior perceptual quality. Nevertheless, these methods often overlook speech intelligibility, leading to alterations or loss of semantic content in the re-synthesized speech. Inspired by the Whisper model's success in target speaker ASR, we propose a generative TSE framework based on the pre-trained Whisper model to address the above issues. This framework integrates semantic modeling with flow-based acoustic modeling to achieve both high intelligibility and perceptual quality. Results from multiple benchmarks demonstrate that the proposed method outperforms existing generative and discriminative baselines. We present speech samples on our demo page.
翻译:目标语音提取(TSE)旨在从多人重叠的语音混合中分离出特定说话人的语音。现有TSE模型大多采用判别式方法,通常通过预测目标语音的时频谱掩码来实现。然而,这些掩码的缺陷常导致目标语音的过抑制或非目标语音的欠抑制,从而降低感知质量。相比之下,生成式方法基于混合语音与目标说话人线索重新合成目标语音,能获得更优的感知质量。但此类方法往往忽视语音可懂度,导致重合成语音的语义内容发生畸变或丢失。受Whisper模型在目标说话人自动语音识别中成功的启发,我们提出一种基于预训练Whisper模型的生成式TSE框架以解决上述问题。该框架将语义建模与基于流的声学建模相结合,同时实现高可懂度与高感知质量。多基准测试结果表明,所提方法优于现有生成式与判别式基线模型。相关语音样本已发布于演示页面。