Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion.
翻译:语音信号本身具有复杂性,因为它同时包含全局声学特征和局部语义信息。然而,在目标语音提取任务中,参考语音中与说话人身份无关的某些全局和局部语义信息可能导致语音提取网络中产生说话人混淆。为克服这一挑战,我们提出了一种自监督解耦表示学习方法。我们的方法通过两阶段过程解决该问题,利用参考语音编码网络和全局信息解耦网络逐步将说话人身份信息与其他无关因素分离开来。我们仅使用解耦后的说话人身份信息来引导语音提取网络。此外,我们引入了自适应调制Transformer,以确保混合信号的声学表示不受说话人嵌入向量的干扰。该组件将说话人嵌入作为条件信息,为语音提取网络提供自然高效的引导。实验结果证实了我们精心设计方法的有效性,显著降低了说话人混淆的可能性。