Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method consistently outperforms SimCLR. This suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.
翻译:对比性说话人嵌入假设语音段正负样本对的对比仅归因于说话人身份。然而,这一假设并不成立,因为语音信号不仅包含说话人身份,还包含语言内容。本文提出一种结合序列解耦的对比学习框架,通过将解耦序列变分自编码器(DSVAE)融入传统SimCLR框架以去除语言内容。DSVAE旨在嵌入空间中解耦说话人因子与内容因子,使得仅说话人因子可用于构建对比损失目标。由于内容因子已从对比学习中移除,最终得到的说话人嵌入将具有内容不变性。在VoxCeleb1测试集上的实验结果表明,所提方法始终优于SimCLR。这表明将序列解耦应用于学习说话人判别性嵌入具有显著优势。