In real-world applications, speaker recognition models often face various domain-mismatch challenges, leading to a significant drop in performance. Although numerous domain adaptation techniques have been developed to address this issue, almost all present methods focus on a simple configuration where the model is trained in one domain and deployed in another. However, real-world environments are often complex and may contain multiple domains, making the methods designed for one-to-one adaptation suboptimal. In our paper, we propose a self-supervised learning method to tackle this multi-domain adaptation problem. Building upon the basic self-supervised adaptation algorithm, we designed three strategies to make it suitable for multi-domain adaptation: an in-domain negative sampling strategy, a MoCo-like memory bank scheme, and a CORAL-like distribution alignment. We conducted experiments using VoxCeleb2 as the source domain dataset and CN-Celeb1 as the target multi-domain dataset. Our results demonstrate that our method clearly outperforms the basic self-supervised adaptation method, which simply treats the data of CN-Celeb1 as a single domain. Importantly, the improvement is consistent in nearly all in-domain tests and cross-domain tests, demonstrating the effectiveness of our proposed method.
翻译:在真实应用场景中,说话人识别模型常面临多种域失配挑战,导致性能显著下降。尽管已开发出众多域自适应技术来解决该问题,但现有方法几乎均聚焦于单一配置场景:模型在一个域中训练后在另一个域中部署。然而,真实环境往往复杂且包含多个域,这使得为一一对应自适应设计的方法难以达到最优效果。本文提出一种自监督学习方法以解决多域自适应问题。在基础自监督自适应算法基础上,我们设计了三种策略使其适用于多域自适应:域内负采样策略、类MoCo记忆库方案及类CORAL分布对齐方法。我们以VoxCeleb2作为源域数据集、CN-Celeb1作为目标多域数据集进行实验。结果表明,相较于简单将CN-Celeb1数据视为单一域的基础自监督自适应方法,本方法性能显著更优。尤为重要的是,该方法在几乎所有域内测试和跨域测试中均呈现一致的性能提升,充分验证了所提方法的有效性。