Learning robust speaker representations under noisy conditions presents significant challenges, which requires careful handling of both discriminative and noise-invariant properties. In this work, we proposed an anchor-based stage-wise learning strategy for robust speaker representation learning. Specifically, our approach begins by training a base model to establish discriminative speaker boundaries, and then extract anchor embeddings from this model as stable references. Finally, a copy of the base model is fine-tuned on noisy inputs, regularized by enforcing proximity to their corresponding fixed anchor embeddings to preserve speaker identity under distortion. Experimental results suggest that this strategy offers advantages over conventional joint optimization, particularly in maintaining discrimination while improving noise robustness. The proposed method demonstrates consistent improvements across various noise conditions, potentially due to its ability to handle boundary stabilization and variation suppression separately.
翻译:在噪声条件下学习鲁棒的说话人表征面临重大挑战,这需要同时谨慎处理判别性和噪声不变性特征。本文提出了一种基于锚点的分阶段学习策略用于鲁棒说话人表征学习。具体而言,我们的方法首先训练一个基础模型以建立判别性说话人边界,随后从该模型中提取锚点嵌入作为稳定参考。最后,基础模型的副本在噪声输入上进行微调,通过强制其接近对应的固定锚点嵌入进行正则化,以在失真情况下保持说话人身份。实验结果表明,该策略相较于传统联合优化方法具有优势,特别是在保持判别性的同时提升噪声鲁棒性方面。所提方法在各种噪声条件下均表现出稳定改进,这可能源于其能够分别处理边界稳定和变异抑制的能力。