Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor, which leverages distance information for TSE to reliably extract speaker information without requiring previous speaker enrolment, called speaker embedding self-enrollment (SESE). Full- & sub-band modeling is introduced to enhance our NS-Extractor's adaptability towards environments with significant reverberation. Experimental results on several cross-datasets demonstrate the effectiveness of our improvements and the excellent performance of our proposed NS-Extractor in different application scenarios.
翻译:先前,目标说话人提取在语音增强和源分离的某些应用场景中已展现出卓越性能。然而,在混响严重的嘈杂环境中,获取辅助的说话人相关信息仍然具有挑战性。受近期基于距离的声音分离方法启发,我们提出了近场声音提取器,其利用距离信息进行目标说话人提取,无需预先注册说话人即可可靠地提取说话人信息,我们称之为说话人嵌入自注册(Speaker Embedding Self-Enrollment, SESE)。我们引入了全频带与子频带建模技术,以增强所提近场声音提取器对严重混响环境的适应性。跨多个数据集的实验结果表明,我们改进方法的有效性以及所提近场声音提取器在不同应用场景中的优异性能。