Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR.
翻译:目标说话人提取旨在根据辅助参考信息,从多说话人混合语音中提取特定说话人的语音。现有研究大多聚焦于目标语音与干扰语音高度重叠的场景,然而这种场景仅占真实对话的很小比例。本文针对稀疏重叠场景展开研究,在此类场景中,辅助参考信息需同时完成两项任务:检测目标说话人的活动状态,并将其活跃语音从干扰语音中分离。我们提出了一种名为ActiveExtract的视听说话人提取模型,该模型利用视听主动说话人检测(ASD)获取的说话活动信息。ASD可直接提供目标说话人的帧级活动状态,其中间特征表示经过训练可区分语音-唇部同步性,这可用于说话人分离。实验结果表明,我们的模型在不同重叠率下均优于基线方法,在SI-SNR指标上平均提升超过4 dB。