Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel reverse selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. We will release the codes, the models and data logs.
翻译:视听目标说话人提取(AV-TSE)旨在利用辅助视觉线索,从音频混合信号中提取特定人物的语音。以往方法通常通过语音-嘴唇同步来搜索目标语音。然而,该策略主要关注目标语音的存在性,而忽略了噪声特性的变化。这可能导致在具有挑战性的声学环境中从错误声源提取噪声信号。为此,我们提出一种新颖的反向选择性听觉注意力机制,该机制能够抑制干扰说话人和非语音信号,从而避免错误的说话人提取。通过该机制估计并利用非期望的噪声信号,我们设计了一个名为“减法与提取网络”(SEANet)的AV-TSE框架,以抑制噪声信号。我们通过复现三种主流AV-TSE方法作为基线,并引入九项指标进行评估,开展了大量实验。实验结果表明,我们提出的SEANet在所有五个数据集上均取得了最先进的结果,并表现出色。我们将公开代码、模型以及数据日志。