Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion.
翻译:语音信号本质复杂,因其同时包含全局声学特征与局部语义信息。然而在目标语音提取任务中,参考语音中与说话人身份无关的全局及局部语义信息会导致语音提取网络产生说话人混淆。为克服这一挑战,我们提出自监督解耦表示学习方法。该方法通过两阶段流程解决该问题,利用参考语音编码网络与全局信息解耦网络逐步将说话人身份信息与其他无关因素解耦。我们仅使用解耦后的说话人身份信息引导语音提取网络。此外,我们引入自适应调制Transformer确保混合语音的声学表征不受说话人嵌入的影响。该组件将说话人嵌入作为条件信息,为语音提取网络提供自然高效的引导。实验结果验证了精心设计方法的有效性,显示说话人混淆概率显著降低。