Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic, often being merely a 1*1024 vector. This dense information makes it difficult for the separation network to harness effectively. To address this limitation, we introduce a pioneering methodology called Hierarchical Representation (HR) that seamlessly fuses anchor data across granular and overarching 5 layers of the separation network, enhancing the precision of target extraction. HR amplifies the efficacy of anchors to improve target speaker isolation. On the Libri-2talker dataset, HR substantially outperforms state-of-the-art time-frequency domain techniques. Further demonstrating HR's capabilities, we achieved first place in the prestigious ICASSP 2023 Deep Noise Suppression Challenge. The proposed HR methodology shows great promise for advancing target speaker extraction through enhanced anchor utilization.
翻译:目标说话人提取旨在通过引导话语(即锚点)从多个声源混合中分离出特定说话人的声音。现有方法主要从锚点中提取说话人嵌入,并将其集成到分离网络中以提取目标说话人的声音。然而,说话人嵌入的表示过于简单,通常仅为一个1×1024的向量。这种密集信息使得分离网络难以有效利用。为解决这一局限,我们提出一种名为分层表示(HR)的开拓性方法,该方法将锚点数据无缝融合至分离网络的5个粒度层与整体层,从而提升目标提取的精度。HR通过增强锚点的有效性改进目标说话人分离。在Libri-2talker数据集上,HR显著优于最先进的时频域技术。为进一步展示HR的能力,我们在ICASSP 2023深度噪声抑制挑战赛中荣获第一名。所提出的HR方法通过增强锚点利用,在推进目标说话人提取方面展现出巨大潜力。