Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor, which leverages distance information for TSE to reliably extract speaker information without requiring previous speaker enrolment, called speaker embedding self-enrollment (SESE). Full- & sub-band modeling is introduced to enhance our NS-Extractor's adaptability towards environments with significant reverberation. Experimental results on several cross-datasets demonstrate the effectiveness of our improvements and the excellent performance of our proposed NS-Extractor in different application scenarios.
翻译:先前,目标说话人提取在语音增强和源分离的特定应用场景中取得了显著性能。然而,在存在显著混响的嘈杂环境中,获取辅助说话人相关信息仍然具有挑战性。受近期基于距离的声音分离研究启发,我们提出了一种近场声音提取器,它利用距离信息进行目标说话人提取,无需提前注册说话人信息即可可靠地提取说话人特征,此方法称为说话人嵌入自注册。我们引入了全频带与子频带建模,以增强所提近场声音提取器对显著混响环境的适应性。在多个跨数据集上的实验结果证明了我们改进的有效性,以及所提近场声音提取器在不同应用场景中的优异性能。