Multi-genre speaker recognition is becoming increasingly popular due to its ability to better represent the complexities of real-world applications. However, a major challenge is the significant shift in the distribution of speaker vectors across different genres. While distribution alignment is a common approach to address this challenge, previous studies have mainly focused on aligning a source domain with a target domain, and the performance of multi-genre data is unknown. This paper presents a comprehensive study of mainstream distribution alignment methods on multi-genre data, where multiple distributions need to be aligned. We analyze various methods both qualitatively and quantitatively. Our experiments on the CN-Celeb dataset show that within-between distribution alignment (WBDA) performs relatively better. However, we also found that none of the investigated methods consistently improved performance in all test cases. This suggests that solely aligning the distributions of speaker vectors may not fully address the challenges posed by multi-genre speaker recognition. Further investigation is necessary to develop a more comprehensive solution.
翻译:多体裁说话人识别因其更能反映真实应用场景的复杂性而日益受到关注。然而,主要挑战在于不同体裁间说话人向量的分布存在显著偏移。尽管分布对齐是应对这一问题的常见方法,但先前研究主要聚焦于源域与目标域的对齐,其对多体裁数据的表现尚不明朗。本文针对主流分布对齐方法在多体裁数据(需对齐多个分布)上的性能进行了全面研究。我们通过定性与定量分析相结合的方式评估了多种方法。在CN-Celeb数据集上的实验表明,体裁间-体裁内分布对齐(WBDA)表现相对更优。然而,我们也发现所研究方法均未能在所有测试场景中持续提升性能。这表明单纯对齐说话人向量分布可能无法完全解决多体裁说话人识别面临的挑战,亟需进一步研究以开发更全面的解决方案。